The Problem: Drowning in Data, Not Diagnoses
Last month, a friend running a small radiology practice in rural Oregon called me, exasperated. They’re drowning in imaging reports. Not complex cases, but the sheer volume of routine scans – chest X-rays, basic MRIs – means their specialists spend hours sifting through normal findings just to find the one or two that need immediate attention. The delay isn’t just an efficiency problem; it’s a patient care issue. This isn’t about replacing radiologists; it’s about giving them back their time to focus on the hard stuff. My friend needed a way to pre-process these reports, flag anything remotely unusual, and present a concise summary to the human expert. That’s where I saw a clear application for AI agents in healthcare diagnostics.
Building the Agent Workflow: LangGraph and RAG in Practice
We started sketching out an agent workflow. The core idea was a multi-agent system: one agent to ingest the raw text reports (often PDFs converted to text, which, yes, is annoying to clean and often requires OCR), another to extract key findings and medical entities, and a third to cross-reference these findings against a knowledge base of common pathologies and normal ranges. We considered a few frameworks. LangChain’s older AgentExecutor felt too brittle for the kind of sequential, conditional logic we needed, especially when dealing with the nuances of medical terminology. CrewAI was tempting for its collaborative agent model, but we ultimately settled on LangGraph. Its state machine approach gave us the explicit control over agent turns and tool calls that healthcare data demands. We couldn’t afford silent failures or agents looping indefinitely on sensitive patient data; every step needed to be predictable and auditable.
The ingestion agent, let’s call it ‘ReportReader,’ used a simple LLM call to parse the text, identifying patient demographics, scan type, and the radiologist’s impression. Its primary tool was a custom Python function that took the raw text and returned structured JSON, adhering to a predefined schema for medical reports. This wasn’t fancy; it just made the data usable for the next step, ensuring consistency. We also had to build in checks for common data formats like DICOM metadata, even if the primary input was text.
Next came ‘AnomalyDetector.’ This agent’s job was more complex. It took the structured JSON from ReportReader and used another LLM call, augmented with a few RAG tools. One RAG tool queried a local vector database of anonymized historical reports, looking for similar cases or patterns that might indicate an unusual finding. This database was meticulously curated and regularly updated by medical professionals. Another RAG tool accessed a curated database of diagnostic criteria and normal physiological ranges, pulling from established medical texts and guidelines. The prompt for AnomalyDetector was critical: ‘Given this patient’s report and relevant medical context, identify any findings that deviate from normal or suggest a potential pathology. Output a confidence score (0-100) and a brief, evidence-based justification, citing the specific finding and the relevant medical context.’ This agent wasn’t making a diagnosis, but it was highlighting areas for human review with supporting evidence.
The final agent, ‘SummaryGenerator,’ took the output from AnomalyDetector – the flagged findings, confidence scores, and justifications – and synthesized a concise summary for the human radiologist. It also had a tool to generate a preliminary ‘action required’ flag (e.g., ‘Urgent Review,’ ‘Routine Follow-up’) based on the confidence score and the nature of the anomaly. This summary was then pushed to a secure internal dashboard, integrated with the clinic’s existing EMR system.
We used LangSmith extensively for tracing and debugging. Honestly, without LangSmith, this project would have been a nightmare. The ability to see every LLM call, every tool invocation, and the exact prompt/response pairs was invaluable. When AnomalyDetector started hallucinating ‘pulmonary embolisms’ on every other chest X-ray, LangSmith showed us exactly which RAG retrieval was failing or which part of the prompt was ambiguous. It’s not cheap, but for production systems, it’s non-negotiable. I think the $150/month for their team plan is fair, considering the time it saves in debugging and the critical nature of the data.
When Agents Fail: Debugging and Compliance in Healthcare
The initial deployment was a mess. We had a few major issues. First, the PDF-to-text conversion was inconsistent. Scanned documents, especially older ones, often produced garbled text, leading ReportReader to output nonsense. Imagine a report where ‘no acute findings’ became ‘no acutefindings’ or ‘mild cardiomegaly’ was parsed as ‘mild card iomegaly.’ These subtle errors, invisible to the naked eye in a quick scan, completely threw off the downstream agents. We had to implement a pre-processing step using an OCR service (Google Cloud Vision API, specifically) before feeding it to our agent. This added cost and complexity, but it was essential for data quality. We also found that certain medical abbreviations or shorthand, common in older reports, were consistently misinterpreted, requiring a custom dictionary and a pre-LLM normalization step.
Second, the AnomalyDetector agent, despite our careful prompting, sometimes got ‘creative’ with its justifications. It would invent plausible-sounding medical reasons for flagging something that was actually normal. For instance, it once flagged a perfectly healthy lung field as having ‘diffuse interstitial thickening’ because the prompt mentioned ‘interstitial’ in a different context, and the LLM latched onto it, fabricating a finding. This was a huge compliance risk and a waste of the human radiologist’s time. We tightened the RAG sources, making sure the knowledge base was strictly factual and peer-reviewed, pulling from sources like UpToDate or specific clinical guidelines, not just general medical texts. We also added a ‘critique’ step within the LangGraph flow: a small, separate LLM call that reviewed AnomalyDetector’s output for logical consistency and adherence to medical guidelines before passing it to SummaryGenerator. This ‘self-correction’ mechanism added latency but drastically reduced false positives. It wasn’t perfect, but it caught the most egregious errors.
Third, managing access and audit trails was a constant headache. Since we were dealing with Protected Health Information (PHI), every interaction, every decision, every data point touched by an agent needed to be logged and auditable. This wasn’t just about ‘saving logs’; it was about proving chain of custody and ensuring data integrity. We built a custom logging service that captured agent states, tool calls, and LLM inputs/outputs, pushing everything to an immutable ledger. This wasn’t something LangGraph or CrewAI provided out-of-the-box; it was a significant engineering effort. We had to consider data retention policies, encryption at rest and in transit, and granular access controls. We also had to integrate with the clinic’s existing identity management system for tool access, which meant custom connectors and strict API key management. The governance overhead for healthcare agents is immense, and anyone telling you otherwise hasn’t shipped one. We spent weeks just getting the data flow compliant with HIPAA regulations, which meant careful data anonymization for training and strict access controls for production data.