Research Presentation Session: Artificial Intelligence and Imaging Informatics

RPS 1205 - Training tomorrow's radiologists and improving our clinical practice: the LLM-powered revolution

March 6, 08:00 - 09:00 CET

6 min
Structured data extraction from non-English brain MRI reports improves with few-shot prompting of an open-weight large language model
Kaouther Mouheb, Rotterdam / Netherlands
Author Block: K. Mouheb1, A. Pomp1, A. Manenti2, H. Seelaar1, F. Mattace-Raso1, M. W. Vernooij1, F. Wolters1, S. Klein1, E. Bron1; 1Rotterdam/NL, 2Toulouse/FR
Purpose: Automatic data extraction from free-text radiology reports enables large-scale research, but current open-weight large language models (LLMs) may underperform because they lack domain knowledge. We evaluate whether few-shot prompting with annotated examples improves extraction from non-English neuroradiology reports.
Methods or Background: We analyzed 947 Dutch free-text brain MRI reports (2016-2021) from a memory clinic. Trained medical students annotated 24 variables. We used the open-weight LLM LLaMA 3.1 to extract these variables. In zero-shot prompting, the model received only task instructions. In few-shot prompting, we provided three annotated examples selected using three strategies: random, fixed, and structural similarity-based selection. Performance was evaluated using overall accuracy, balanced accuracy for categorical, accuracy for numerical, and text similarity for free-text fields, averaged across 10 random train-test splits.
Results or Findings: The overall accuracy of LLaMA 3.1 obtained with few-shot prompting with similarity-based selection (89%, [CI: 87–90%]) was significantly higher than that obtained with zero-shot prompting (81%, [81–82%]). Similar improvements were seen with other selection strategies (random: 84%, [82–86%]; fixed: 86% [83–90%]). Notably, few-shot prompting with similarity-based selection showed significantly higher accuracies in extracting microbleed counts (92% [90-93%]) and infarct counts (81% [77-85%]) compared to zero-shot prompting (microbleeds: 80% [78-82%], infarcts: 66% [63-68%]). Visual rating scores were extracted accurately in both settings, with slight gains from similarity-based few-shot prompting (balanced accuracy: 95% [92–97%] for Fazekas score, 92% [79–100%] for MTA, and 89% [85–93%] for GCA). Performance remained lower on location-specific variables (e.g., 69% for occipital GCA with similarity-based selection).
Conclusion: LLaMA 3.1 accurately extracted 24 clinical variables from neuroradiology reports. Few-shot prompting significantly improved performance, especially with similarity-based selection.
Limitations: The limitations of the study are that it involves a single LLM, one imaging modality, and reports from one center written in Dutch.
Funding for this study: This study was co-funded by an Erasmus MC fellowship 2022. The study is part of TAP-dementia (www.tap-dementia.nl), receiving funding from ZonMw (#10510032120003) in the context of Onderzoeksprogramma Dementie, part of the Dutch National Dementia Strategy. This work was co-funded by Scan2go, a TKI-LSH funded public-private partnership
(LSHM22046-H036). This work was co-funded by the European Union under Grant Agreement number 1011100633. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them. This work used the Dutch national supercomputer Snellius with the support of the SURF Cooperative and a small compute grant from NWO using grant number EINF-11268.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the institutional Medical Ethics Committee with a waiver of informed consent (METC-2023-0569).
6 min
Automated image quality assessment in shoulder MR arthrography using large language model-generated Python workflows
Apostolos Petkoglou, Emmen / Netherlands
Author Block: A. Petkoglou1, A. Vegter1, T. Kwee2, H. Stallmann1; 1Stadskanaal/NL, 2Groningen/NL
Purpose: To compare automated image sharpness measurements with manual signal and contrast to noise radio's (SNR, CNR) assessments and radiologist Likert scores in shoulder MR arthrography using saline versus gadolinium contrast.
Methods or Background: 40 patients aged 13-20 years underwent shoulder MRA between 2019-2024 on a 1.5T scanner. 20 received gadolinium (T1-weighted sequences), 20 received saline (T2-weighted sequences). Claude and Google Gemini generated Python code in Colab to convert DICOM to JPEG and perform automated sharpness measurements using variance of the Laplacian in 20 randomly placed 100×100 pixel squares centered around the joint. Manual measurements included normalized SNR/CNR from circular ROIs (1-2mm labrum, 2-5mm contrast, 10-12mm background air). Two musculoskeletal radiologists independently rated image quality using 5-point Likert scales, blinded to contrast type. Statistical analysis included Mann-Whitney U tests, Spearman correlations with FDR correction, and bootstrapped non-inferiority testing (δ=0.15).
Results or Findings: Automated sharpness showed plane-dependent differences: gadolinium sharper in axial view (whole image p=0.0006, random crops p=0.0056), saline sharper in coronal view (random crops p=0.0411). Pooled analysis showed no significant sharpness differences (p=0.1371, p=0.7765). Inter-rater agreement ranged from ICC=0.27-0.41 (single rater) to ICC=0.52-0.66 (average raters). Only supraspinatus tendon showed preference for saline by one radiologist (p=0.024). Diagnostic certainty correlated positively with normalized CNR across multiple comparisons (FDR-corrected p<0.05).
No significant differences in normalized SNR (p=0.870) or CNR (p=0.271) between contrast agents. Non-inferiority of saline confirmed via bootstrapping (95% CI within ±0.15 margin).
Conclusion: AI-generated automated workflows successfully quantified image quality metrics comparable to manual assessment. This approach enables hypothesis testing, upscaling to large datasets, and implementation in quality control and equipment procurement decisions with minimal programming expertise required.
Limitations: Single-center study with adolescent population may limit generalizability. Subjective assessments prone to inter-reader variability.
Funding for this study: We have applied for a grant from the Stichting De Cock -Hadders (research foundation).
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethics approval obtained from Scientific Committee of Treant Zorggroep for this retrospective study.
6 min
Agentic AI framework for structured reporting (CAD-RADS) and grounding from unstructured CCTA reports
Sinan Batman / United States
Author Block: A. G. D'Sa1, N. Saini1, A. Vazquez2, S. Batman3, G. Urrutia3; 1Bengaluru/IN, 2Frankfurt/DE, 3Durham, NC/US
Purpose: Clinical radiology priors provide detailed patient history and imaging findings but remain unstructured (CCTA reports) and time-consuming to review. Structured reporting (CAD-RADS) standardizes and condenses reports for clarity and consistency, yet converting unstructured reports is laborious. Advances in Natural Language Processing (NLP) and Large Language Models (LLMs) enables transforming unstructured text into structured reports. However, LLMs can produce unreliable outputs (hallucinations), so detecting errors and linking structured content to the original report is crucial for accuracy and traceability.
Methods or Background: The procedure has two phases: (i) structured reporting using multiple agents, and (ii) grounding and hallucination detection (Figure 1). The document is processed by a "Study Understanding Agent" to identify the study type (e.g., plaque analysis). The "Fields Generating Agent" is invoked three times to compile a comprehensive list of fields. Next, the "Value Extraction Agent" extracts corresponding field values from the report. Empty fields are filtered out, followed by "Validation Agent" that detects incorrectly filled or unnecessary fields. On phase two, "Grounding Agent" traces values back to their sources, adding redundancy for validation. This modular approach ensures comprehensive, accurate, and validated structured reporting.
Results or Findings: Evaluated on 50 coronary CT angiography (CCTA) reports, our structured reporting framework extracted over 85% of required fields with 96% accuracy for filled values. The grounding agent achieved 88% accuracy.
Conclusion: Structured reporting accelerates clinical studies by highlighting essential fields clearly and efficiently. LLMs enable automatic generation of structured reports with high precision. Additionally, LLMs can be leveraged for traceability and validation of generated content, ensuring reliable and accurate reporting.
Limitations: Currently, evaluation is limited to CCTA reports. In future we can extend to other report types (Li-RADS, Pi-RADS) and enhance grounding accuracy using traditional NLP techniques alongside LLMs.
Funding for this study: ConcertAI
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Initial Insights into an Institutional Secure Large Language model for MRI Examination Requests
Yi Xian Low, Singapore / Singapore
Author Block: J. T. P. D. Hallinan, N. W. Leow, Y. X. Low, A. Lee, W. Ong, D. Z. M. Chan, D. D-L. Loh, A. Makmur, Y. Ting; Singapore/SG
Purpose: To compare clinician MRI examination requests (MERs) with institutional secure large language model (sLLM)-augmented MERs for information quality and to evaluate protocoling accuracy of the sLLM versus board-certified radiologists across body, musculoskeletal, and neuroradiology MRI.
Methods or Background: Incomplete clinical details on MERs can lead to sub-optimal protocol selection. An institutional sLLM with access to the electronic medical record (EMR) may improve request completeness and protocol accuracy across multiple MRI subspecialties.

This retrospective study included 608 consecutive MRI examinations comprising 528 patients performed between September 2023 and July 2024. A privately hosted Anthropic Claude 3.5 model augmented each MER with EMR data and, via rule-based parsing, recommended region/coverage and contrast use. Two experienced radiologists established a consensus reference standard. Two board-certified general radiologists and the sLLM were compared with this standard. Clinical-information quality was graded using the Reason-for-Exam Imaging Reporting and Data System (RI-RADS). Inter-rater reliability was quantified with Gwet’s AC1. Paired accuracies were compared with McNemar testing to determine if there was a statistically significant difference.
Results or Findings: Limited or deficient clinical information fell to 0–0.7% with sLLM augmentation versus 5.2–20.4% for clinician MERs. Overall protocol accuracy was 93.1% for the sLLM, 91.4% for Rad 3, and 92.1% for Rad 4. Region/coverage accuracy was similar (sLLM 95.2%, Rad 3 96.2%, Rad 4 94.2%). Contrast decisions were more accurate using the sLLM at 94.4% versus Rad 3 at 92.1%.
Conclusion: Across subspecialty MRI, sLLM-augmented examination requests had improved clinical context and contrast selection while matching general radiologists for region/coverage. Integrating sLLMs into vetting workflows may reduce manual workload and standardize protocoling.
Limitations: This was a single-centre, retrospective evaluation. Whether similar gains would be realised in institutions that use different order-entry systems is a future area of research.
Funding for this study: Singapore Ministry of Health National Medical Research Council under the NMRC Clinician Innovator Award (CIAINV23jan-0001, MOH-001405).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Large Language Model-Assisted CT Protocol Selection: A Comparative Study of Finetuning vs Zero-Shot Reasoning
Philipp Arnold, Freiburg Im Breisgau / Germany
Author Block: P. Arnold, M. Russe, E. Kotter, M. Scholz, M. Jeuck, L. Heine, T. Stein, S. Walter; Freiburg Im Breisgau/DE
Purpose: To evaluate whether large language models (LLMs) can assist radiologists in selecting the optimal CT protocol by name, correct body region, and appropriate use of intravenous contrast, we compared two strategies: supervised fine-tuning on institutional data versus zero-shot reasoning with structured prompting.
Methods or Background: CT protocol selection balances diagnostic yield against radiation and contrast risks. We retrospectively analyzed 20,000 CT exams with clinical indications, history, and standardized protocol labels covering 100 distinct protocol names, body region, and contrast use.

Fine-tuning: Qwen-2.5 models (3B–32B) were trained to map clinical text to the correct protocol name, body region, and contrast use.
Zero-shot reasoning: Qwen-3 models (4B–32B) received structured prompts with the same inputs, with or without prior imaging reports.
Benchmarking: Three resident radiologists independently selected protocols for the same test set. Accuracy was assessed for (1) protocol name, (2) body region, and (3) contrast use.
Results or Findings: Fine-tuned LLMs achieved 68–77% accuracy for protocol names across 100 labels (Top-3: 84–87%) and 90–96% accuracy for body region and contrast, with the 7B model performing best.
Zero-shot models reached 91–93% accuracy for body region and contrast, comparable to residents (87–90% / 89–94%), but lower for protocol names (53–56% vs. 58–62%). Prior imaging reports improved contrast accuracy by 1–4% across all models.
Conclusion: LLMs show potential as decision-support tools for CT protocol selection. Fine-tuned models surpassed radiologists in protocol-name matching, while zero-shot models matched human performance on individual components and leveraged prior reports to enhance contrast decisions. To improve protocol selection, inclusion of key clinical details in the provided data is essential.
Limitations: Single-institution data and protocol taxonomy may limit generalizability.
Funding for this study: German Research Foundation (DFG) - SFB 1597–499552394.
Hans A. Krebs Programme (University Clinic Freiburg im Breisgau)
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethics Vote register number: FRKS004287
6 min
The Expertise Paradox: Who Benefits from LLM-Assisted Brain MRI Differential Diagnosis?
Su-Hwan Kim, Munich / Germany
Author Block: S. Schramm1, B. Le Guellec2, L. C. Adams1, K. Bressem1, J. S. Kirschke1, D. M. Hedderich1, B. Wiestler1, S. H. Kim1; 1Munich/DE, 2Lille/FR
Purpose: To evaluate how reader experience influences the diagnostic benefit from large language model-assisted brain MRI differential diagnosis.
Methods or Background: Neuroradiologists (n = 4), radiology residents (n = 4), and neurology/neurosurgery residents (n = 4) were recruited. A dataset of complex brain MRI cases was curated from the local imaging database (n = 40). For each case, readers provided a textual description of the main imaging finding and their top three differential diagnoses (“Unassisted”). Three state-of-the-art large language models (GPT-4.1, Gemini 2.5 Pro, DeepSeek-R1) were prompted to generate top-three differentials based on the clinical case description and reader-specific findings. Readers then revised their differential diagnoses after reviewing GPT-4.1 suggestions (“Assisted”). To statistically evaluate the association between reader experience and diagnostic benefit, a cumulative link mixed model was fitted with change in diagnostic result as ordinal outcome, reader experience as a fixed effect, and random intercepts for rater and case.
Results or Findings: LLM-generated differential diagnoses achieved the highest top-3 accuracy when provided with image descriptions from neuroradiologists (top-3: 78.8-83.8%), followed by radiology residents (top-3: 71.8-77.6%), and other neurology/neurosurgery residents (top-3: 62.6-64.5%). In contrast, relative gains in top-3 accuracy from LLM assistance diminished with increasing experience, with +19.2% for neurology/neurosurgery residents (from 43.2% to 62.6%),+14.7% for radiology residents (from 59.6% to 74.4%), and +4.4% for neuroradiologists (from 83.1% to 87.5%). The cumulative link mixed model confirmed a significant negative association between reader experience and diagnostic benefit from LLM assistance (p = 0.005).
Conclusion: With increasing reader experience, absolute diagnostic LLM performance with reader-specific input improved, while relative diagnostic gains through LLM assistance paradoxically diminished. Our findings call attention to the gap between isolated LLM performance and actual clinical relevance, emphasizing the need to account for human-AI interaction.
Limitations: N/A
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethics Committee of the Technical University of Munich
6 min
European guideline informed RAG-based GPT-4 decision support tool in tumor board meetings for breast cancer treatment
Javid Abbasli, Baku, Azerbaijan / Azerbaijan
Author Block: N. Abdullayev1, V. Valiyev2, J. Abbasli2, S. Sanduleanu3, J. Kottlors1, S. Lennartz1, H. Habibov1, F. Yilmaz4; 1Troisdorf/DE, 2Baku, Azerbaijan/AZ, 3Brunssum/NL, 4Düsseldorf/DE
Purpose: To determine whether a retrieval-augmented (RAG) GPT-4 model (“MammaBoardGPT”) grounded in European breast cancer guidelines improves agreement with multidisciplinary tumor board (MTB) decisions versus baseline GPT-4.
Methods or Background: Single-centre, retrospective analysis of 25 breast cancer cases discussed at a German hospital MTB. For each case, baseline GPT-4 and a RAG-enhanced GPT-4—few-shot conditioned with five MTB-labelled exemplars and guideline passages—generated management recommendations from the same structured case summary. Agreement with final MTB decisions was categorized as complete / partial / none. We additionally assessed the effect of recursive prompting using the Stuart–Maxwell test.
Results or Findings: After recursive prompting, MammaBoardGPT reached 84% complete agreement and 16% partial agreement, with 0% disagreements versus MTB decisions. Standard GPT-4 achieved 76% complete, 20% partial, and 4% complete disagreement. Agreement improved significantly for MammaBoardGPT before vs after recursive prompting (P = 0.0048), but not for standard GPT-4 (P = 0.135). Post-recursive prompting, there was no significant difference between MammaBoardGPT and GPT-4 (P = 0.37).
Conclusion: A European guideline-grounded, RAG-enhanced GPT-4 shows high concordance with MTB decisions and benefits from recursive prompting; however, the retrospective single-centre design, small cohort, restricted inputs/corpus, and lack of prospective outcome and safety assessment temper generalisability. Prospective, multi-centre, real-time studies with robust governance are required before clinical deployment.
Limitations: This single-centre retrospective study with a small sample (N=25) limits statistical power and generalisability. Inputs were structured summaries only (no imaging/EMR) with retrieval confined to selected European guidelines; few-shot/prompt design may introduce bias and potential information leakage, and outcomes relied on concordance metrics (workflow, safety, and regulatory impacts were not assessed).
Funding for this study: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. (Funding acquisition: not applicable.)
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the Ethics Committee of the Ärztekammer Nordrhein (approval no. 235–2024).
6 min
Large Language Model-Assisted Simplification of CT Staging Reports for Cancer Patients: A Prospective Quasi-Randomized Trial
Felix Busch, Munich / Germany
Author Block: F. Busch, P. Prucker, K. K. Bressem, J. Peeken, A. W. Marka, S. H. Kim, S. Ziegelmayer, M. R. Makowski, L. C. Adams; Munich/DE
Purpose: To evaluate whether large language model (LLM)-assisted simplification of CT staging reports improves cancer patients' cognitive workload, text comprehension, report perception, and reading time.
Methods or Background: Prospective, controlled, open-label, quasi-randomized, pre-registered trial of 200 adult cancer patients undergoing routine CT re-staging with alternate 1:1 allocation to receive either the unmodified report or a locally generated LLM-simplified version (Llama-3.3-70B, on-premise via basebox) with mandatory radiologist review. Co-primary outcomes were reading time and three composite scores (cognitive workload, text comprehension, patient perception), each derived from three 7-point Likert items. Secondary outcomes included readability indices, word count, medical-terminology ratings, and independent radiologist assessments of factual errors, omissions, insertions, clinical usefulness, and overall quality. Logistic regression was performed to analyze patient-reported outcomes, adjusting for patient characteristics.
Results or Findings: Simplification reduced median reading time from 7 to 2 minutes (adjusted β: -3.86; 95% confidence interval (CI): -5.46, -2.26; P<.001). Patients reported lower cognitive workload (adjusted odds ratio (aOR): 0.18; 0.13, 0.25), higher text comprehension (aOR: 13.28; 9.31, 18.93), and enhanced perception of report usefulness (aOR: 5.46; 3.55, 8.38; all P<.001). Readability was significantly improved across metrics (e.g., Flesch-Kincaid Grade Level from 13.69 ± 1.13 to 8.89 ± 0.93; P<.001). Two radiologists independently identified factual errors in 6% of simplified reports (2 moderate, 4 severe), omissions in 7% (2 minor, 1 moderate, 4 severe), and unsupported insertions in 3% (1 minor, 2 moderate). The majority of simplified reports were rated clinically useful and of good or better quality.
Conclusion: LLM simplification substantially improves patient-centered outcomes and readability of CT staging reports while maintaining generally favorable clinical usefulness and quality. However, clinically relevant errors underscore the need for expert radiologist oversight before clinical implementation.
Limitations: Single-center, open-label with alternate allocation, self-reported patient outcomes.
Funding for this study: None.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Technical University of Munich (2025-186-S-KK)
6 min
Local large language models for MRI protocol selection: A privacy-preserving alternative to cloud AI
Timotheus Josef Neumann, Bonn / Germany
Author Block: Z. Bendella, T. J. Neumann, Z. Ganji, R. Clauberg, N. Lehnen, A. Radbruch, M. Wolter, B. D. Wichtmann; Bonn/DE
Purpose: Cloud-based AI like ChatGPT has demonstrated high accuracy for MRI protocol selection. However, transmitting patient data to external servers entails privacy risks. This study evaluated whether LLaMA-3.1-8B, a compact open-source model deployable locally, can achieve comparable accuracy while maintaining complete data sovereignty.
Methods or Background: This IRB-approved, retrospective study used real-world radiology referral forms (RRFs) and corresponding MRI sequences from our institutional neuroradiology department. For model development, 8,281 consecutive MRI examinations (November 2023–January 2025) were included, with RRFs extracted from the RIS and executed sequences from the PACS, split into training (n = 6,624) and validation (n = 1,657). For testing, 1,001 consecutive RRFs (August 2023–July 2024) covering the full range of neuroradiological MRI protocols were included, with ground truth protocol selections defined by two board-certified neuroradiologists. LLaMA-3.1-8B was fine-tuned using prompt-tuning and LoRA (Low-Rank Adaptation), and compared against the pretrained model and ChatGPT-4. Accuracy was evaluated using BioBERT-based similarity scores between model outputs and expert ground truth.
Results or Findings: ChatGPT-4 achieved 99.7% accuracy on the test set. The locally deployed pretrained LLaMA-3.1-8B achieved 93.2% accuracy, improving to 94.5% with prompt-tuning and 95.4% with LoRA fine-tuning.
Conclusion: Despite being orders of magnitude smaller than ChatGPT-4, the locally deployed and fine-tuned LLaMA-3.1-8B achieved near-comparable accuracy in MRI protocol selection. This demonstrates the feasibility of privacy-preserving, institutionally controlled AI solutions to support radiologists without external data transfer, combining clinical utility with data sovereignty. These results establish a performance benchmark for local LLM deployment in radiology protocol selection.
Limitations: Fine-tuning refinements may further improve accuracy, though radiologist oversight remains essential for diagnostic safety. As this work employed a comparatively modest 8-billion-parameter model, future investigations with larger locally deployable models are expected to yield even higher performance.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The retrospective study received approval from the local Ethics Committee for Clinical Trials on Humans and Epidemiological Research with Personal Data, IRB number: 312/23-EP.