Research Presentation Session: Artificial Intelligence and Imaging Informatics

RPS 2005 - How large language models are transforming radiological reporting

March 7, 14:00 - 15:30 CET

6 min
Multicenter Clinical Trial on the Application of a Self-Developed AI-Assisted Detection Software for Intracranial Aneurysms on MRA Images
Xin Cao, Shanghai / China
Author Block: Y. Bao, X. Cao, Q. Zhao, X. Zhao, B. Huang, Y. Luo, Z. Zheng, W. Liu, D. Geng; Shanghai/CN
Purpose: Our team has independently developed AIneurysm, a computer-aided detection software for TOF-MRA, which has obtained the Class III Medical Device Registration Certificate in China. By accurately segmenting the cerebral arteries, it assists physicians in detecting aneurysm lesions. The multicenter clinical trial was subsequently launched to rigorously validate its diagnostic performance.
Methods or Background: This study is a prospective, multicenter, fully-crossed multi-reader multi-case trial of the AIneurysm, conducted from December 2024 to November 2027. Five medical centers are prospectively gathering cranial MRA data, targeting enrollment of 1,050 cases. Each center assigned two junior radiologists as the participating readers. They conducted two rounds of full-slice reading for all images in their respective centers: a physician-independent group (control) and an AI-assisted group (experimental). The presence or absence of aneurysms, as well as the location of the lesions and the maximum diameter of the aneurysms, were recorded. The consensus reference standard was established by three senior radiologists. The Dorfman-Berbaum-Metz-Hillis analysis was used to compare the area under the alternative free-response receiver operating characteristic (AFROC) curves, sensitivity, and specificity between the two groups.
Results or Findings: A total of 484 cases with 258 lesions have been enrolled until now. The difference in AFROC AUC between the experimental and control groups was 0.0409 (P=0.001, 95%CI:0.020–0.062), indicating superiority of the experimental group. The experimental arm outperformed the control arm with an AFROC AUC gain of 0.0409 (P=0.001, 95%CI:0.020–0.062). Lesion-level sensitivity improved by 0.070 (P=0.010, 95%CI:0.036–0.104), surpassing the predefined superiority threshold, while case-level specificity rose by 0.012 (F=0.457, P=0.010, 95%CI:0.016–0.040), satisfying the non-inferiority criterion.
Conclusion: The interim results show that the AI-assisted detection software AIneurysm for intracranial aneurysms on MRA images demonstrated superior diagnostic performance compared to independent reading by junior physicians.
Limitations: Omitted.
Funding for this study: Science and Technology Commission of Shanghai Municipality (24SF1904200).
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethics Committees of the five medical centers:
Huashan Hospital Institutional Review Board, Ethics Committee of Longhua Hospital affiliated to Shanghai University of Traditional Chinese Medicine, Medical Ethics Committee of Shanghai Pudong New Area Gongli Hospital, Medical Ethics Committee of Shanghai Fifth People's Hospital Affiliated to Fudan University, and Medical Ethics Committee of Shanghai Fourth People's Hospital.
6 min
Report-Driven Segmentation: Zero-Shot LLM Extraction of SUVmax and Lesion Location from PET/CT for automatic tumor segmentation
Christian Bojahr, Essen / Germany
Author Block: S. Warmer, L. Umutlu, J. Haubold, Y. Wen, C. S. Schmidt, C. Bojahr, K. A. Borys, F. Nensa, R. Hosch; Essen/DE
Purpose: Standardized uptake value (SUV) metrics, particularly SUVmax, quantify PET/CT radiotracer uptake, which is relevant for tumor identification and detailing. However, this condensed critical information is often hidden in radiological reports. We evaluate a large language model (LLM) for extracting SUVmax values with corresponding anatomical sites and examine its impact on initiating automated tumor segmentation.
Methods or Background: We selected PET/CT reports from 100 patients (female=38, 66±9.93 years, NSCLC=97, SCLC=3) diagnosed with lung cancer between 2006 and 2020, each performed within ±30 days of initial diagnosis. The reports were analyzed using a 70B LLaMA 3.3 instruct model with zero-shot prompting for extracting SUVmax values including its corresponding body region. Radiological experts evaluated the extracted data using a questionnaire to assess whether the SUVmax values and locations were correctly extracted. In a subsequent use case, the extracted SUVmax values and location were used to define a seed coordinate for initializing automatic tumor segmentation using the Body and Organ Analysis (BOA) and the nnInteractive. A radiologist then evaluated the resulting segmentation masks case by case.
Results or Findings: The LLM accurately extracted SUVmax values and their corresponding locations in 97% of cases, demonstrating high consistency across report styles and time periods. The automatically generated segmentation masks based on the extracted coordinates and values were clinically usable without modification in 70% of the cases.
Conclusion: This study demonstrates that LLMs accurately extract SUVmax and anatomical context from unstructured PET/CT reports. The structured output enabled automated tumor segmentation, underscoring the potential of LLMs as integral components in clinical segmentation pipelines.
Limitations: The limitations of the study are the small sample size, single tumor type focus, an imbalance between NSCLC and SCLC cases, and the use of one LLM with limited prompt strategies.
Funding for this study: No funding was provided for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was performed in adherence to all guidelines defined by the approving institutional review board of the investigating hospital. The Institutional Review Board waived written informed consent due to the retrospective nature of the study. Complete anonymization of all data was performed before inclusion in the study.
6 min
Large language models-based simplification of Breast Imaging Reports: A prospective multicentric study
Matilde Pavan, Milan / Italy
Author Block: V. Magni1, M. Pavan1, A. Cozzi2, A. Liguori1, F. Pesapane1, S. Carriero1, G. Carrafiello1; 1Milan/IT, 2Lugano/CH
Purpose: To evaluate patients’ perception of breast imaging reports simplified by ChatGPT-4 compared to radiologist-written reports, focusing on simplicity, comprehensibility, and empathy, and to assess the role of educational level in shaping preferences.
Methods or Background: This prospective multicenter study included 10 anonymized mammography and ultrasound reports (BI-RADS 1–5), each simplified by ChatGPT-4 with a 50-word limit. Report pairs (original vs. AI-simplified) were assessed by 300 patients (2965 responses) and 20 physicians using a Likert-scale questionnaire on simplicity, comprehensibility, and empathy. Preferences and demographic data were collected, and logistic regression analyzed factors influencing choices.
Results or Findings: AI-generated reports were preferred in 63.3% of responses. They scored higher for simplicity (69.7% levels 4–5), comprehensibility (67.8% levels 4–5), and empathy (predominantly levels 3–4). Higher scores in all three domains significantly increased the likelihood of AI preference. Participants with advanced education (Bachelor’s/Master’s degrees) showed a stronger inclination toward AI-simplified reports. Physicians confirmed the clinical accuracy and safety of AI outputs.
Conclusion: ChatGPT-4 can generate simplified breast imaging reports that patients perceive as clearer, more comprehensible, and more empathetic than traditional versions. This approach may enhance patient understanding and engagement, while maintaining accuracy. Broader validation in different languages, clinical contexts, and AI platforms is warranted.
Limitations: Only 10 report pairs were tested, all in Italian, and only ChatGPT-4 was evaluated. More complex BI-RADS categories remain challenging for AI simplification. These factors may limit generalizability.
Funding for this study: This study received no external funding. Institutional resources from participating centers supported the project.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
RAG across scales: A multi-backbone comparison of guideline-grounded LLM-agent sequential decision-making for ED acute abdominal pain
Romain Andre, Nijmegen / Netherlands
Author Block: R. Andre, H. E. Huisman; Nijmegen/NL
Purpose: Assess how guideline-grounded retrieval-augmented generation (RAG) improves diagnostic performance and sequential imaging/laboratory request-behavior of LLM-agents across backbone sizes (1B-70B) and domain-specific trainings (general/biomedicine) for emergency-department acute abdominal pain pathologies.
Methods or Background: Using the MIMIC-IV-Ext Clinical-Decision-Making dataset (2,400 ED pathways: appendicitis, cholecystitis, diverticulitis, pancreatitis, sharing acute abdominal pain as initial symptom), we compared seven instruction-tuned backbones: Llama-3.2-1B, Mistral-7B-v0.3, Gemma-2-9B, Llama-3.1-8B-UltraMedical, Qwen3-30B, Llama-3.1-70B, Llama-3.1-70B-UltraMedical, spanning both generalist and biomedical-fine-tuned models. LLM-Agents iteratively requested physical-examination, laboratory tests, or imaging (modality and region), received the corresponding reports, then autonomously finalized once judged sufficient evidence is retrieved, issuing a diagnosis and care plan without assistance. With RAG, guidelines snippets were retrieved from a maintainable, disease-scoped knowledge-base at each thinking step and appended to the working context before each action, grounding the iterative process in citable, expert-authored sources.
Results or Findings: RAG improved average diagnostic accuracy across every backbone. Relative gains were most notable for smaller models (1-9B: from 46.5% to 55.1%), larger models (30-70B) also improved (67.4% to 72.8%). RAG reduced requests for non-existent tools (i.e. hallucinations), while increasing alignment of imaging orders with clinician trajectories and guideline indications, and maintained disciplined laboratory selection. RAG-equipped agents gathered more evidence before finalization and specified imaging parameters (modality/region) more consistently. Overall, RAG enhanced transparency by surfacing citable guidance throughout the decision chain.
Conclusion: Across seven backbones from 1B to 70B, including both generalist and biomedical-tuned models, guideline-grounded RAG consistently improves diagnostic accuracy and imaging decision behavior, supporting safer, more auditable LLM assistance for ED acute abdominal pain.
Limitations: Work limited to the ER domain, focusing only on four pathologies from a single-centre, English-language dataset. Only open-weight models were explored to respect MIMIC-IV's data use agreement. Prospective, multi-institutional validation and broader symptom coverage are needed.
Funding for this study: This study is part of the HealthyAI project with number KICH3.LTP.20.006 of the research programme KIC which is (partly) financed by the Dutch Research Council (NWO) and with co-funding by Siemens Healthineers.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
MumbleMED: Introducing a Framework for Fine-Tuning Medical Speech Transcription in Radiology Utilising Large Language and Text-to-Speech Models
Sina Warmer, Essen / Germany
Author Block: S. Warmer, A. Idrissi-Yaghir, K. A. Borys, J. Haubold, C. S. Schmidt, Y. Wen, K. Arzideh, F. Nensa, R. Hosch; Essen/DE
Purpose: General-purpose speech recognition models, such as OpenAI's Whisper, struggle with the complex terminology and structure of medical language, which limits their use in radiology. We present MumbleMED, a domain-adapted open-source speech-to-text model fine-tuned using a pipeline that combines large language models (LLMs) and text-to-speech (TTS) to generate high-quality medical training data.
Methods or Background: Synthetic medical German texts were created by random sampled structured concepts from ICD, SNOMED CT, and RadLex using Qwen3-235B. These texts were converted to audio using a TTS engine featuring 17 distinct speakers (female=47%) from our institution. The resulting dataset of 13689 samples (4530 minutes of total audio time) was used to fine-tune OpenAI’s Whisper model (V2-large), resulting in MumbleMED, a medical variant in German language. Model performance was evaluated on a test set of 450 synthetic samples and 97 real radiology report dictations using Word Error Rate (WER), Character Error Rate (CER), and compared to the unmodified Whisper baseline.
Results or Findings: MumbleMED achieved a WER of 3.73% and CER of 1.97% on the synthetic test set, outperforming the baseline Whisper model (WER=20.33%, CER=11.65%). On real radiology reports, MumbleMED (WER=39.78%, CER=17.92%) also outperformed Whisper (WER=70.65%, CER=46.88%), showing strong recognition of medical domain-specific vocabulary and typical terminology of German radiology reports.
Conclusion: MumbleMED shows that LLM- and TTS-based synthetic data can effectively fine-tune speech-to-text models for clinical use. The approach enables more accurate and reliable transcription of radiological dictations, reducing the need for manual correction and supporting faster, streamlined reporting workflows. In addition, this pipeline can be used to fine-tune the model for any language in the medical domain.
Limitations: The limitations of the study are the synthetic training data and limited speaker variations, which may not fully capture spontaneous or accented speech patterns.
Funding for this study: No funding was provided for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was performed in adherence to all guidelines defined by the approving institutional review board of the investigating hospital. The Institutional Review Board waived written informed consent due to the retrospective nature of the study. Complete anonymization of all data was performed before inclusion in the study.
6 min
Evaluating Large Language Models for FHIR-Compatible Structured Reporting from Kidney Stone CT Reports
Philipp Arnold, Freiburg Im Breisgau / Germany
Author Block: P. Arnold, E. Kotter, J. Jahn; Freiburg Im Breisgau/DE
Purpose: To assess whether large language models (LLMs) can convert free-text kidney stone CT reports into standardized HL7 FHIR Questionnaire format and to compare performance across input styles and model sizes.
Methods or Background: We collected 99 German abdominal CT reports (50 free-text, 49 semi-structured with section headings). A kidney stone FHIR Questionnaire with 33 key fields was derived from a published consensus template. Three locally hosted Qwen models (8B, 14B, 32B parameters) were prompted field by field to generate FHIR QuestionnaireResponses, which were compared with a radiologist-annotated ground truth. Metrics included ground-truth completeness (proportion of fields present in the source report), AI completeness (proportion of ground-truth fields correctly retrieved by the model), and per-field accuracy (exact or semantically equivalent match across all fields).
Results or Findings: Semi-structured source reports contained more of the expected information (ground-truth completeness 77% vs. 65% for free text). Across all models, both accuracy and AI completeness were higher for semi-structured inputs. Qwen-32B achieved 93% per-field accuracy and 97% AI completeness on semi-structured reports (vs. 82% and 92% on free text). The 14B model reached 91% accuracy and 95% AI completeness (vs. 83%/94%), while the 8B model achieved 83%/95% (vs. 69%/87%).
Conclusion: LLMs can automatically generate FHIR-compliant structured kidney stone CT reports from textual input with high accuracy. Semi-structured reports yield higher accuracy and completeness. The approach supports prospective workflows, where radiologists dictate freely while an LLM drafts a structured report for rapid review, as well as retrospective extraction of structured data from existing reports.
Limitations: Single-center study focused on one exam type; performance on more heterogeneous imaging reports remains to be validated.
Funding for this study: German Research Foundation (DFG) - SFB 1597–499552394.
Hans A. Krebs Programme (University Clinic Freiburg im Breisgau)
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethics Vote register number: FRKS004287
6 min
Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS)
Li Yi Tammy Chan, Singapore / Singapore
Author Block: L. Y. T. Chan, D. Z. M. Chan, Y. L. Tan, Q. V. Yap, W. Ong, A. Lee, J. H. Tan, N. Kumar, J. T. P. D. Hallinan; Singapore/SG
Purpose: In diagnostic radiology, LLMs can assist in the computation of the Spine Instability Neoplastic Score (SINS), which is a critical tool for assessing spinal metastases. However, the accuracy of LLMs in calculating the SINS based on radiological reports remains under-explored. This study evaluates the accuracy of two institutional privacy-preserving LLMs - Claude 3.5 and Llama 3.1 - in computing the SINS from radiology reports and electronic medical records.
Methods or Background: A retrospective analysis was conducted on 124 radiology reports from patients with spinal metastases. Three expert readers established a reference standard for the SINS calculation. Two orthopaedic surgery residents and two LLMs (Claude 3.5 and Llama 3.1) independently calculated the SINS. The intraclass correlation coefficient (ICC) was used to measure the inter-rater agreement for the total SINS, while Gwet’s Kappa was used to measure the inter-rater agreement for the individual SINS components.
Results or Findings: Both LLMs and clinicians demonstrated almost perfect agreement with the reference standard for the total SINS. Between the two LLMs, Claude 3.5 (ICC = 0.984) outperformed Llama 3.1 (ICC = 0.829). Claude 3.5 was also comparable to the clinician readers with ICCs of 0.926 and 0.986, exhibiting a near-perfect agreement across all individual SINS components [0.919–0.990].
Conclusion: Claude 3.5 demonstrated high accuracy in calculating the SINS and may serve as a valuable adjunct in clinical workflows, potentially reducing clinician workload while maintaining diagnostic reliability. However, variations in LLM performance highlight the need for further validation and optimisation before clinical integration.
Limitations: The training of LLMs was performed with only one prompt strategy, and performance may vary with alternative prompt methods. The evaluation of LLMs were assessed at a single time point using a single institution dataset, limiting conclusions about reproducibility and generalisability.
Funding for this study: Direct Funding from MOH/NMRC. This research is supported by the Singapore Ministry of Health National Medical Research Council under the NMRCClinician Innovator Award (CIA). Grant Title: Deep learning pipeline for augmented reporting of MRI whole spine (CIAINV23jan-0001, MOH-001405)
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Ethical review and approval were waived for this study as it was granted a Domain-Specific Review Board waiver owing to minimal risk
6 min
Empowering radiology training with AI: Large Language Models in resident error detection and feedback
Alberto Kyling, Santiago / Chile
Author Block: A. Kyling, M. Salinas, G. Briceño, C. Pizarro, P. F. Guzman, D. Ladrón de Guevara; Santiago/CL
Purpose: The implementation of artificial intelligence in healthcare is advancing globally. Automated analysis of errors in radiology reports could provide objective and personalized feedback, supporting staff radiologists’ teaching, optimizing resident training, and improving radiological care quality.
Methods or Background: This observational, retrospective study analyzed 213 paired radiology reports (CT, MRI, US) from first- and second-year residents, validated by staff radiologists. Reports were anonymized and processed by a LLM (Gemini-3-Pro) to classify errors (structural, semantic, diagnostic) and measure textual concordance via cosine-similarity embeddings. A random subset of 40 report pairs (18%) underwent independent senior radiologist validation.
Results or Findings: 213 report pairs plus 40 controls were analyzed. Median cosine similarity was 0.99 for error-free reports and 0.91 for those with ≥5 diagnostic errors; 0.90 served as discriminant threshold (specificity 100%, sensitivity 13.1%). Spearman correlation between similarity and diagnostic error count was ρ=–0.53 (p <0.001). Controls (deliberately erroneous or incoherent) consistently scored <0.85. Diagnostic errors averaged 2.10±1.43 per report (85.5% had ≥1 error), with omission/false negatives comprising 67.8%. Triaxial-qualitative classification showed poor reliability for error detection (Kappa 0.04 at threshold 0.90), but acceptable reliability for ruling out error-free reports (Kappa 0.36 at threshold 0.99). Manual validation achieved 100% concordance (40/40).
Conclusion: The LLM reliably identified discrepancies and graded severity (ρ=–0.53). While not robust for error detection, it showed acceptable reliability in validating error-free reports, suggesting its potential as an educational support tool, particularly for targeting omission errors which comprised the majority of diagnostic discrepancies.
Limitations: Single-center, limited sample size, risk of hallucinations, and reliance on staff reports as reference standard constrain generalizability, highlighting the need for larger, multicenter, validation studies.
Funding for this study: This research received no external funding and was conducted without dedicated financial support.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Approved by the Ethics Committee of Hospital San Juan de Dios, Santiago, Chile.
6 min
Beyond Evans Index: CT Ventricular Nomograms with 12 Sub-compartments from an LLM-Curated Cohort with nnU-Net Segmentation
Nathan Vishwanathan, Basel / Switzerland
Author Block: N. Vishwanathan, S. Griot, J. Wasserthal, S. Yang, M. Segeroth, J. M. Lieb, M. Bach, M-N. Psychogios, M. A. Mutke; Basel/CH
Purpose: To build age- and sex-specific CT ventricular volume nomograms (12 subcompartments) and show how they can be used in everyday reporting.
Methods or Background: Single-centre retrospective study, 2019–2024. A locally run, German-tuned large language model screened head-CT reports and brief clinical summaries to exclude stroke, mass lesions, relevant white-matter disease, dementia/MCI, or cognitive decline. The final “normal” cohort was 3,086 examinations from 2,964 adults (14–98 years). Ventricular volumes were segmented into 12 subcompartments with an nnU-Net model. We report decade- and sex-stratified values, model performance with 95% CIs, false-discovery correction for multiple tests, and effect sizes.
Results or Findings: LLM labelling accuracy across diagnostic categories was 0.966–0.992; for enlarged ventricles 0.992 (95% CI 0.978–0.997). Segmentation achieved median Dice 0.918 (95% CI 0.910–0.923). Total ventricular volume rose with age, with a clear step from 50–59 to 60–69: +33.9% in females and +49.4% in males. Most subcompartments showed moderate to strong age correlations (r=0.44–0.68), while the fourth ventricle changed little (r=0.02). Male volumes were 8–19% higher than female volumes after correction. We provide percentile nomograms (5th–95th) with 95% CIs and decade means/SDs for each subcompartment and sex.
Conclusion: A simple, on-premises LLM + nnU-Net workflow can curate large CT datasets and produce reliable ventricular nomograms. These charts make reporting more objective: a scan can be flagged when a volume is at or above the 95th percentile for the patient’s age and sex, supporting earlier recognition of disproportionate dilatation and focused clinical follow-up.
Limitations: Single-centre, retrospective Swiss cohort, limited generalisability; no external validation.
Cross-sectional nomograms; extreme ages (≥90 y) have wide CIs; no longitudinal/test–retest data.
nnU-Net performance validated internally only; no multi-centre benchmark.
Funding for this study: This study received no external funding.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethics waiver was granted.
6 min
Towards speech assistants in reporting workflows: interoperability challenges of structured reporting, speech recognition and large language models
Benedikt Kämpgen, Würzburg / Germany
Author Block: B. Kämpgen1, G. Arnhold2, J. Stöckmann3, I. Schmittel1, F. Jungmann2, D. Feiler3, D. Pinto dos Santos2, P. Mildenberger2, T. Jorg2; 1Würzburg/DE, 2Mainz/DE, 3Munich/DE
Purpose: Speech-based dialogue systems for structured reporting (SR) promise to improve both reporting quality and efficiency (Jorg et al., 2023, https://doi.org/10.1186/s13244-023-01392-y). For studies involving multiple templates, however, the system must be seamlessly integrated into radiologists’ workflows. This reveals interoperability challenges between structured reporting, speech recognition and large language model components.
Methods or Background: We designed an architecture with defined interfaces connecting an open-source structured reporting tool (Dos Santos et al., 2017, https://doi.org/10.1007/s00330-016-4344-0), a commercial speech recognition system (DFC-SYSTEMS), and a commercial large language model (Empolis). The reporting tool is launched from the RIS, where an appropriate SR template (e.g., for urolithiasis) is automatically selected based on the examination. A "speech assistant" button opens a chat window, allowing the user to answer template-guided questions via microphone. At any point, particularly when the system has no further questions, the user may accept or reject the prefilled SR template. This workflow minimises look-away interruptions and maximises efficiency gains in structured reporting.
Results or Findings: The application captures audio from the microphone, performs server-based speech-to-text conversion, and forwards the text to a server for text-to-structure processing. Integration across components is achieved via a unified JSON data model, which stores the iteratively completed template and message context/history. Functionality and integration tests demonstrated full vocabulary mapping across the template, speech recognition, and language model components.
Conclusion: Speech-based structured reporting, supported by large language models, is approaching clinical deployment. Our results offer valuable insights for initiatives addressing workflow integration and interoperability challenges.
Limitations: The limitations of the study are that it only included automatic functionality and integration testing based on fictional and anonymised cases (n > 50) but no clinical assessment of efficiency gains. A clinical study with speech assistant reporting is still in progress.
Funding for this study: Funding was provided by the Bundesministerium für Bildung und Forschung (BMBF), 2022-2025, grant agreement number: 16SV9045, project KIPA.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Benchmarking Large Language Models for Follow-up Recommendations After Abdominal Ultrasound
Vincenzo Vingiani, Bolzano / Italy
Author Block: V. Vingiani, R. Valletta, N. Cortellini, B. Proner, V. Corato, L. Hoxha, F. Ferro, M. Bonatti; Bolzano/IT
Purpose: To evaluate whether large language models (LLMs) provide consistent, guideline-concordant follow-up recommendations after abdominal ultrasound (US) and to benchmark multiple systems.
Methods or Background: We assembled 200 simulated abdominal US cases covering liver, gallbladder/biliary tree, pancreas, spleen, kidneys, and retroperitoneum. Two expert abdominal radiologists defined the ground-truth management (no imaging follow-up, only US follow-up, or additional diagnostic work-up), including modality and timing when further imaging was indicated. Thirty cases refined prompts with GPT-4o; 170 were held out for evaluation. Seven LLMs (GPT-5, GPT-4o, GPT-4o mini, Gemini 2.5 Flash, Gemini 2.5 Pro, Claude 4 Sonnet, and DeepSeek-V3) were tested in zero-shot mode through Firefox with cache cleared and sessions restarted before each query. Stability was assessed across five independent runs. Agreement with the consensus was quantified using Cohen’s κ (unweighted and weighted) and F1 scores.
Results or Findings: GPT-5 achieved the highest accuracy for the three management categories (no imaging follow-up, only US follow-up, or additional diagnostic work-up), 0.988; the lowest was DeepSeek-V3 with 0.829. Weighted κ ranged from 0.755 (DeepSeek-V3) to 0.964 (GPT-5). F1 scores mirrored this pattern, with GPT-5 at 0.988, followed by Gemini 2.5 Pro (0.954) and Gemini 2.5 Flash (0.948). Management-decision stability across five runs was high for all models (0.928–0.968). When further imaging was required, correct selection of the second-level modality and timing was highest for GPT-5 (accuracy 0.779) and lowest for Gemini 2.5 Flash (0.609).
Conclusion: LLMs can translate abdominal US reports into actionable, guideline-aligned follow-up recommendations. GPT-5 performed best overall, supporting the role of LLMs as adjunctive decision support to standardise post-ultrasound imaging decisions across healthcare settings.
Limitations: Synthetic, text-only cases may limit generalizability; zero-shot, browser-based testing may reduce performance in optimised deployments; evolving model versions may affect reproducibility and external validity.
Funding for this study: None
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Feasibilty of using a Large Language Model for automated extraction of clinically relevant findings from whole-body MRI reports
Luca Di Palma, Milan / Italy
Author Block: L. Di Palma, M. Alì, A. Lad, G. D'Anna, F. Darvizeh, I. Castiglioni, D. Fazzini; Milan/IT
Purpose: To evaluate the feasibility and accuracy of using a large language model (LLM) to extract and structure clinically relevant information from free-text whole-body MRI (WB-MRI) reports.
Methods or Background: This study included 327 WB-MRI reports from a preventive health screening program. Reports were processed with the DeepSeek-R1-Llama3.3 LLM to extract findings classified according to the ONCO-RADS system, including their anatomical locations. Only ONCO-RADS ≥3 findings were analyzed, as they represent suspicious or actionable abnormalities; ONCO-RADS 1–2 were excluded. LLM outputs were compared with original reports, independently reviewed and annotated by three subspecialist radiologists (neuroradiology, musculoskeletal, body imaging; >5 years’ experience). Radiologists validated whether extracted ONCO-RADS scores and locations matched the reports. Discrepancies were categorized as: (1) missing findings, (2) localization errors (minor: ovary vs uterus; pleura vs lung; major: different organ), and (3) false positives.
Results or Findings: Out of 4,902 total findings, radiologists identified 237 as ONCO-RADS ≥3. Among these, 232 (97.9%) were categorized as ONCO-RADS 3, 3 (1.3%) as ONCO-RADS 4, and 2 (0.8%) as ONCO-RADS 5. The LLM accurately extracted 207 of these cases (87.3%) with full agreement in both classification and location. There were 30 discrepancies (12.7%), comprising 17 missed findings (7.2%) and 13 localization errors (5.5%). Of the localization errors, 11 were minor, while 2 were considered major. Additionally, LLM reported 16 false positives.
Conclusion: This study shows that LLMs can accurately extract clinically relevant findings from free-text WB-MRI reports, with high concordance and minimal clinically significant errors. This suggests strong potential for LLMs in supporting report structuring in radiology.
Limitations: Use of a single LLM for ONCO-RADS extraction may limit generalizability.
Funding for this study: C.D.I. Ricerca Innovazione e Sviluppo
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethical committee was requested (study ID:  6309)
6 min
The utility of “thinking” in CAD-RADS scoring of an open-source hybrid-reasoning large language model
Lennart Roelof Koetzier, Utrecht / Netherlands
Author Block: V. Sandfort1, D. Vigneault1, L. R. Koetzier1, M. J. Willemink2, J. Wu2, R. Hallett1, K. Nieman1, D. Fleischmann1, D. Mastrodicasa3; 1Stanford, CA/US, 2Palo Alto, CA/US, 3Seattle, WA/US
Purpose: Large language models (LLMs) are developed to answer questions or follow instructions. While earlier generation LLMs gave quick, general responses, they could not perform multi-step reasoning. Reinforcement-learning has enabled “reasoning” models, such as Qwen3-235B, which operate in both “thinking” and “non-thinking” modes. Although LLMs have been evaluated for extracting information from radiology reports, the effect of “thinking” on this task remains unclear. We evaluated the effect of “thinking” in Qwen3-235B on the performance of determining CAD-RADS scores from cardiac CT-reports.
Methods or Background: We retrospectively included 500 de-identified cardiac CT-reports from four hospitals across three USA regions using an online platform (Segmed). CAD-RADS categories were determined in consensus by three cardiovascular imaging experts. Qwen3-235B was run in fp8-quantization via API (together.ai) in both “thinking” and “non-thinking” mode. Thinking was measured in thinking-characters between thinking-tags, divided into quintiles (Q1=least thinking; Q5=most thinking). Model performance was assessed using unweighted Cohen’s kappa.
Results or Findings: Model performance in “thinking” mode was numerically higher than “non-thinking” mode (0.791 [0.751-0.835] vs 0.732 [0.686-0.777], respectively). In “thinking” mode, best performance was seen in Q1 (kappa=0.893) and declined with more thinking (Q5-kappa=0.452). In “non-thinking” mode, performance was highest in Q1 (kappa=0.944) and Q2 (kappa=0.901), but lowest in Q5 (kappa=0.365).
Conclusion: An open-source hybrid-reasoning LLM accurately determined CAD-RADS scores from cardiac CT-reports. Very long “thinking” (Q5) was associated with poor performance, suggesting it may serve as a model-confidence indicator. “Non-thinking” worked better for easy cases, while “thinking” was advantageous in difficult cases.
Limitations: Qwen3-235B is not intended for medical use, and our study only evaluated performance.
Funding for this study: None
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: