RAG across scales: A multi-backbone comparison of guideline-grounded LLM-agent sequential decision-making for ED acute abdominal pain
Author Block: R. Andre, H. E. Huisman; Nijmegen/NL
Purpose: Assess how guideline-grounded retrieval-augmented generation (RAG) improves diagnostic performance and sequential imaging/laboratory request-behavior of LLM-agents across backbone sizes (1B-70B) and domain-specific trainings (general/biomedicine) for emergency-department acute abdominal pain pathologies.
Methods or Background: Using the MIMIC-IV-Ext Clinical-Decision-Making dataset (2,400 ED pathways: appendicitis, cholecystitis, diverticulitis, pancreatitis, sharing acute abdominal pain as initial symptom), we compared seven instruction-tuned backbones: Llama-3.2-1B, Mistral-7B-v0.3, Gemma-2-9B, Llama-3.1-8B-UltraMedical, Qwen3-30B, Llama-3.1-70B, Llama-3.1-70B-UltraMedical, spanning both generalist and biomedical-fine-tuned models. LLM-Agents iteratively requested physical-examination, laboratory tests, or imaging (modality and region), received the corresponding reports, then autonomously finalized once judged sufficient evidence is retrieved, issuing a diagnosis and care plan without assistance. With RAG, guidelines snippets were retrieved from a maintainable, disease-scoped knowledge-base at each thinking step and appended to the working context before each action, grounding the iterative process in citable, expert-authored sources.
Results or Findings: RAG improved average diagnostic accuracy across every backbone. Relative gains were most notable for smaller models (1-9B: from 46.5% to 55.1%), larger models (30-70B) also improved (67.4% to 72.8%). RAG reduced requests for non-existent tools (i.e. hallucinations), while increasing alignment of imaging orders with clinician trajectories and guideline indications, and maintained disciplined laboratory selection. RAG-equipped agents gathered more evidence before finalization and specified imaging parameters (modality/region) more consistently. Overall, RAG enhanced transparency by surfacing citable guidance throughout the decision chain.
Conclusion: Across seven backbones from 1B to 70B, including both generalist and biomedical-tuned models, guideline-grounded RAG consistently improves diagnostic accuracy and imaging decision behavior, supporting safer, more auditable LLM assistance for ED acute abdominal pain.
Limitations: Work limited to the ER domain, focusing only on four pathologies from a single-centre, English-language dataset. Only open-weight models were explored to respect MIMIC-IV's data use agreement. Prospective, multi-institutional validation and broader symptom coverage are needed.
Funding for this study: This study is part of the HealthyAI project with number KICH3.LTP.20.006 of the research programme KIC which is (partly) financed by the Dutch Research Council (NWO) and with co-funding by Siemens Healthineers.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: