Agent Platforms7 min read

Shipping AI Agents in Healthcare Diagnostics: What Actually Breaks

Dan Hartman headshotDan HartmanEditor··7 min read

Shipping AI agents in healthcare diagnostics is hard. Learn how we built a system to triage radiology reports, what broke, and why LangSmith was essential for debugging and compliance.

The Problem: Drowning in Data, Not Diagnoses

Last month, a friend running a small radiology practice in rural Oregon called me, exasperated. They’re drowning in imaging reports. Not complex cases, but the sheer volume of routine scans – chest X-rays, basic MRIs – means their specialists spend hours sifting through normal findings just to find the one or two that need immediate attention. The delay isn’t just an efficiency problem; it’s a patient care issue. This isn’t about replacing radiologists; it’s about giving them back their time to focus on the hard stuff. My friend needed a way to pre-process these reports, flag anything remotely unusual, and present a concise summary to the human expert. That’s where I saw a clear application for AI agents in healthcare diagnostics.

Building the Agent Workflow: LangGraph and RAG in Practice

We started sketching out an agent workflow. The core idea was a multi-agent system: one agent to ingest the raw text reports (often PDFs converted to text, which, yes, is annoying to clean and often requires OCR), another to extract key findings and medical entities, and a third to cross-reference these findings against a knowledge base of common pathologies and normal ranges. We considered a few frameworks. LangChain’s older AgentExecutor felt too brittle for the kind of sequential, conditional logic we needed, especially when dealing with the nuances of medical terminology. CrewAI was tempting for its collaborative agent model, but we ultimately settled on LangGraph. Its state machine approach gave us the explicit control over agent turns and tool calls that healthcare data demands. We couldn’t afford silent failures or agents looping indefinitely on sensitive patient data; every step needed to be predictable and auditable.

The ingestion agent, let’s call it ‘ReportReader,’ used a simple LLM call to parse the text, identifying patient demographics, scan type, and the radiologist’s impression. Its primary tool was a custom Python function that took the raw text and returned structured JSON, adhering to a predefined schema for medical reports. This wasn’t fancy; it just made the data usable for the next step, ensuring consistency. We also had to build in checks for common data formats like DICOM metadata, even if the primary input was text.

Next came ‘AnomalyDetector.’ This agent’s job was more complex. It took the structured JSON from ReportReader and used another LLM call, augmented with a few RAG tools. One RAG tool queried a local vector database of anonymized historical reports, looking for similar cases or patterns that might indicate an unusual finding. This database was meticulously curated and regularly updated by medical professionals. Another RAG tool accessed a curated database of diagnostic criteria and normal physiological ranges, pulling from established medical texts and guidelines. The prompt for AnomalyDetector was critical: ‘Given this patient’s report and relevant medical context, identify any findings that deviate from normal or suggest a potential pathology. Output a confidence score (0-100) and a brief, evidence-based justification, citing the specific finding and the relevant medical context.’ This agent wasn’t making a diagnosis, but it was highlighting areas for human review with supporting evidence.

The final agent, ‘SummaryGenerator,’ took the output from AnomalyDetector – the flagged findings, confidence scores, and justifications – and synthesized a concise summary for the human radiologist. It also had a tool to generate a preliminary ‘action required’ flag (e.g., ‘Urgent Review,’ ‘Routine Follow-up’) based on the confidence score and the nature of the anomaly. This summary was then pushed to a secure internal dashboard, integrated with the clinic’s existing EMR system.

We used LangSmith extensively for tracing and debugging. Honestly, without LangSmith, this project would have been a nightmare. The ability to see every LLM call, every tool invocation, and the exact prompt/response pairs was invaluable. When AnomalyDetector started hallucinating ‘pulmonary embolisms’ on every other chest X-ray, LangSmith showed us exactly which RAG retrieval was failing or which part of the prompt was ambiguous. It’s not cheap, but for production systems, it’s non-negotiable. I think the $150/month for their team plan is fair, considering the time it saves in debugging and the critical nature of the data.

When Agents Fail: Debugging and Compliance in Healthcare

The initial deployment was a mess. We had a few major issues. First, the PDF-to-text conversion was inconsistent. Scanned documents, especially older ones, often produced garbled text, leading ReportReader to output nonsense. Imagine a report where ‘no acute findings’ became ‘no acutefindings’ or ‘mild cardiomegaly’ was parsed as ‘mild card iomegaly.’ These subtle errors, invisible to the naked eye in a quick scan, completely threw off the downstream agents. We had to implement a pre-processing step using an OCR service (Google Cloud Vision API, specifically) before feeding it to our agent. This added cost and complexity, but it was essential for data quality. We also found that certain medical abbreviations or shorthand, common in older reports, were consistently misinterpreted, requiring a custom dictionary and a pre-LLM normalization step.

Second, the AnomalyDetector agent, despite our careful prompting, sometimes got ‘creative’ with its justifications. It would invent plausible-sounding medical reasons for flagging something that was actually normal. For instance, it once flagged a perfectly healthy lung field as having ‘diffuse interstitial thickening’ because the prompt mentioned ‘interstitial’ in a different context, and the LLM latched onto it, fabricating a finding. This was a huge compliance risk and a waste of the human radiologist’s time. We tightened the RAG sources, making sure the knowledge base was strictly factual and peer-reviewed, pulling from sources like UpToDate or specific clinical guidelines, not just general medical texts. We also added a ‘critique’ step within the LangGraph flow: a small, separate LLM call that reviewed AnomalyDetector’s output for logical consistency and adherence to medical guidelines before passing it to SummaryGenerator. This ‘self-correction’ mechanism added latency but drastically reduced false positives. It wasn’t perfect, but it caught the most egregious errors.

Third, managing access and audit trails was a constant headache. Since we were dealing with Protected Health Information (PHI), every interaction, every decision, every data point touched by an agent needed to be logged and auditable. This wasn’t just about ‘saving logs’; it was about proving chain of custody and ensuring data integrity. We built a custom logging service that captured agent states, tool calls, and LLM inputs/outputs, pushing everything to an immutable ledger. This wasn’t something LangGraph or CrewAI provided out-of-the-box; it was a significant engineering effort. We had to consider data retention policies, encryption at rest and in transit, and granular access controls. We also had to integrate with the clinic’s existing identity management system for tool access, which meant custom connectors and strict API key management. The governance overhead for healthcare agents is immense, and anyone telling you otherwise hasn’t shipped one. We spent weeks just getting the data flow compliant with HIPAA regulations, which meant careful data anonymization for training and strict access controls for production data.

The Real Win: Augmenting Expertise, Not Replacing It

Despite the hurdles, the system delivered. The biggest win was the reduction in human review time for routine cases. Radiologists now receive a pre-digested summary, with potential anomalies clearly highlighted. They can quickly scan the agent’s output, review the original report, and the Make platformtheir final determination. This isn’t about automation; it’s about augmentation. My friend’s practice saw a 30% reduction in the time spent on initial report triage within three months. That’s more time for complex cases, more time for patient consultations, and frankly, less burnout for the specialists.

One specific feature I loved was the confidence score from AnomalyDetector. It wasn’t perfect, but it gave the human a quick heuristic. A low confidence score on a flagged item meant ‘double-check this carefully, agent might be guessing.’ A high score meant ‘this is likely real, confirm it.’ It built trust, which is paramount in healthcare. We also found Bardeen.ai surprisingly useful for some of the peripheral tasks, like automatically creating follow-up tasks in their practice management system when an ‘Urgent Review’ flag was raised. It’s a low-code platform, but for connecting the agent’s output to existing clinic software, it saved us a ton of custom API work. I wouldn’t use it for core agent logic, but for the ‘last mile’ integration, it’s pretty handy.

If you want the deep cut on this, AI meeting tools coverage.

The real value of AI agents in healthcare diagnostics isn’t in replacing human expertise, but in refining it. It’s about offloading the mundane, repetitive tasks that drain specialists’ time and attention. It’s about creating a force multiplier for highly skilled professionals. The free tier of most agent frameworks is enough for solo work, but if you’re building anything that touches real patient data, you’ll need the enterprise features, the dedicated support, and the robust logging that comes with paid plans. Don’t skimp on observability or security; the cost of a data breach or a misdiagnosis far outweighs any savings on tooling. This isn’t a ‘set it and forget it’ technology. It requires constant monitoring, refinement, and a deep understanding of both the medical domain and the agent’s limitations. But when done right, it makes a tangible difference.

— The Colophon

One AI tool. Tested. Reviewed.
In your inbox every Sunday.

~3 minute read. Real outcomes from operators, not marketers.