Research Presentation Session: Imaging Informatics and Artificial Intelligence

RPS 1505 - Generative AI in radiology

February 28, 14:00 - 15:30 CET

7 min
Precision, Non-classifications, and Misclassifications of General and Medical Large Language Models in Liver Lesions Classification using LI-RADS from Unstructured Radiology Reports
Wan Hang Keith Chiu, Hong Kong / Hong Kong SAR China
Author Block: J. Lu1, F. F-Y. Tang1, J. Ng1, C. Chan2, H. M. Cheng1, P. L. H. Yu1, W. K. W. Seto1, W. H. K. Chiu1; 1Hong Kong/HK, 2Hampshire/UK
Purpose: Large Language Models (LLM) are powerful tools for data extraction and summarization. However, scant evidence exists as to whether a medial-specific LLM is necessary to perform radiology tasks. This study evaluates the performance between a general and a medical LLM in extracting and categorizing liver lesions from radiology reports according to the LI-RADS.
Methods or Background: A total of 273 anonymized unstructured Computed Tomography (CT) reports, written by 115 radiologists from 5 institutions containing 599 liver observations were retrospectively collected. These reports were fed into GPT-4 and MedLM to assign LI-RADS categories for each observation using zero-shot prompts (GPT4sp and MedLMsp) and instructions post-prompt engineering (GPT4pe and MedLMpe). Ground truths and quality of the CT reports were derived by 2 board-certified radiologists.
Results or Findings: At lesion level, the accuracies for correctly classifying malignant lesions (LR-4/5/M) were 0.584, 0.634, 0.668, and 0.84 for GPT-4sp, MedLMsp, GPT-4pe, and MedLMpe respectively with MedLM outperforming GPT-4 using both simple prompts (p=0.023) and prompt engineering (p<0.001). At patient level, the accuracies were 0.762, 0.744, 0.791, and 0.883, respectively, with prompt engineering outperforming simple prompts in MedLM (p < 0.001). Prompt engineering improved performance by reducing non-classification in both MedLM (11.5% vs 33.7%, p<0.001) and GPT-4 (29.4% vs 38.4% p<0.001). The quality of the CT reports of the 31 misclassified/non-classified patients on MedLMpe were considered average (median LIKERT score 3/5) with a Fleiss’ ĸ value 0.563 (95%CI 0.356 - 0.770).
Conclusion: While general LLM exhibits potential in text-based medical tasks, our findings suggest that medical LLM yields superior performance.
Limitations: Limitations include a small sample size, lack of prompt engineering exploration, and only one of general and medical LLM used.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The ethics committee notification can be found under Ref: KC/KE-23-0083/ER-3.
7 min
Evaluating the Performance of LLaMA 3.1 in Classifying Mammography Reports Based on BIRADS Scores
Amit Kumar, New Delhi / India
Author Block: A. Kumar, V. K. Venugopal; New Delhi/IN
Purpose: This study aimed to evaluate the performance of the LLaMA 3.1 large language model (LLM) in classifying mammography reports based on the Breast Imaging-Reporting and Data System (BIRADS) classification without fine-tuning the model.
Methods or Background: A total of 930 mammography reports, covering a range of BIRADS classifications from 0 to 6, were processed using the LLaMA 3.1 open-source LLM (8B version). The model was prompted using a five-shot prompting technique. The classification accuracy of the algorithm was analyzed, and errors in classification were recorded. Among the 930 reports, 8 instances of errors were identified, with 4 cases where BIRADS 2 was incorrectly classified as BIRADS 4 by the model.
Results or Findings: The LLaMA 3.1 model demonstrated a classification accuracy of 921 correct classifications out of 930 reports, yielding an overall accuracy rate of 98.99%. Despite the model's strong performance, errors were present, particularly in the misclassification of lower BI-RADS scores, with some benign reports (BIRADS 2) being classified at a higher risk level (BIRADS 4)
Conclusion: LLaMA 3.1, even without fine-tuning, shows significant potential for accurately classifying mammography reports based on BIRADS scoring. This indicates that large language models could serve as valuable tools in medical imaging analysis, offering high accuracy with minimal adjustments.
Limitations: The study is limited by the occurrence of misclassification in a small number of cases, particularly in distinguishing between benign and higher-risk categories. Further studies with larger datasets and fine-tuning may be needed to improve reliability.
Funding for this study: The study didn't receive any funding
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Anonymized data was used
7 min
Evaluating local open-source large language models for data extraction from unstructured reports on mechanical thrombectomy in patients with ischemic stroke
Aymen Meddeb, Berlin / Germany
Author Block: A. Meddeb1, A. Othman2, N. F. Grauhan2, M. Scheel3, J. Nawabi3; 1Reims/FR, 2Mainz/DE, 3Berlin/DE
Purpose: To assess the effectiveness of open-source Large Language Models (LLMs) in extracting clinical data from unstructured mechanical thrombectomy reports in patients with ischemic stroke caused by a vessel occlusion.
Methods or Background: We deployed local open-source LLMs to extract data points from free-text procedural reports in patients who underwent mechanical thrombectomy between September 2020 and June 2023 in our institution. The external dataset was obtained from a second university hospital and comprised consecutive cases treated between September 2023 and March 2024. Ground truth labeling was facilitated by a human-in-the-loop (HITL) approach, with time metrics recorded for both automated and manual data extractions. We tested three models—Mixtral, Qwen, and BioMistral—assessing their performance on precision, recall, and F1 score across 15 clinical categories such as National Institute of Health Stroke Scale (NIHSS) scores, occluded vessels, and medication details.
Results or Findings: The study included 1000 consecutive reports from our primary institution and 50 reports from a secondary institution. Mixtral showed the highest precision, achieving 0.99 for first series time extraction and 0.69 for occluded vessel identification within the internal dataset. In the external dataset, precision ranged from 1.00 for NIHSS scores to 0.70 for occluded vessels. The HITL approach yielded an average time savings of 65.6% per case, with variations from 45.95% to 79.56%.
Conclusion: LLMs showed high performance in automated clinical data extraction from medical reports. Incorporating HITL annotations enhances precision and also ensures the reliability of the extracted data. This methodology presents a scalable privacy-preserving option that can significantly support clinical documentation and research endeavors.
Limitations: Variability in the quality and consistency of the input data, such as differences in terminology, formatting, or detail level in the reports, can affect the performance of the models
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This retrospective study was approved by the ethics committee of the Charité University Hospital in Berlin (No. EA4/062/20).The requirement for informed consent was waived due to the retrospective design of the study.
7 min
Large language models in healthcare: DRAGON performance benchmark for clinical NLP
Joeran Sander Bosma, Nijmegen / Netherlands
Author Block: J. S. Bosma1, K. Dercksen1, M. De Rooij1, F. Ciompi1, A. Hering1, J. Geerdink2, H. Huisman1; 1Nijmegen/NL, 2Almelo/NL
Purpose: Artificial Intelligence (AI) requires large-scale annotated datasets to train clinical algorithms to perform at an expert level. Natural Language Processing (NLP) shows great potential to annotate large volumes of data from clinical routine and facilitate the training of these algorithms. This study aims to introduce a benchmark for clinical NLP algorithms, including Large Language Models (LLMs), to assess the ability of algorithms to extract information from medical reports.
Methods or Background: The DRAGON (Diagnostic Report Analysis: General Optimization of NLP) challenge has three objectives. First, it provides a unique and publicly available cloud-based benchmark for clinical NLP that spans 28 clinically relevant tasks. 28,824 annotated medical reports from five Dutch care centers from multiple imaging modalities (MRI, CT, X-ray, histopathology) and conditions spanning the entire body (lungs, pancreas, prostate, skin, etc.) are used. The tasks are designed to facilitate automated dataset curation and include predicting diagnoses, extracting lesion sizes, identifying protected health information, and more. Second, we release foundational LLMs pretrained using four million clinical reports from a sixth Dutch care center. Third, we investigate three pretraining strategies across five architectures by evaluating LLMs using the DRAGON benchmark.
Results or Findings: Results showed the superiority of domain-specific pretraining (benchmark score of 0.770, 95% CI 0.755-0.785) and mixed-domain pretraining (0.756, 95% CI 0.739-0.773), compared to general-domain pretraining (0.734, 95% CI 0.717-0.752, p<0.005). The best model achieved excellent or good performance for 18/28 tasks and poor or moderate performance for 10/28 tasks.
Conclusion: The DRAGON benchmark showed that NLP is ready to facilitate data curation in some settings, enabling high-quality, low-cost, and large-scale annotation, and uncovered where innovations are needed to improve clinical NLP.
Limitations: Half of the tasks were sourced from a single academic tertiary care center (14/28, 50%).
Funding for this study: Funding was provided by Health~Holland (LSHM20103), European Union HORIZON-HLTH-2022: COMFORT (101079894), European Union HORIZON-2020: ProCAncer-I project (952159), European Union HORIZON-2020: PANCAIM project (101016851), and NWO-VIDI grant (number 18388). The collaboration project is co-funded by PPP Allowance awarded by Health~Holland, Top Sector Life Sciences & Health, to stimulate public-private partnerships. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Retrospective use of anonymous patient data was approved by institutional or regional review boards at each contributing center (identifiers: CMO 2016-3045; IRBd22-159; A21-0349 2; A20-0777), and was conducted in accordance with the principles of the Declaration of Helsinki. Informed consent was waived.
7 min
Implementing Local Large Language Models and using Clinical Data Warehouse for Clinical Summarization and Decision Support
Martin Segeroth, Basel / Switzerland
Author Block: M. Segeroth, M. Bach, J. Wasserthal, J. Cyriac, M. Pradella, C. Breit, B. Stieltjes, E. M. Merkle, S. Yang; Basel/CH
Purpose: Recent advances in Large Language Models (LLMs) have improved medical text summarization and decision support but raised data privacy concerns. We aim to integrate local LLMs into clinical workflows for testing with real-world patient data.
Methods or Background: Within our institutional healthcare network, a clinical data warehouse (CDWH) serves as a central hub for querying all patient records and parameters, while ensuring data privacy. Exemplary parameters like temporal evolution of chemotherapies, dates and outcomes of resections, findings from previous imaging examinations, etc. were extracted for oncology patients. The collected data were fed via a prompt into local LLMs. We utilized privateGPT and Ollama as the primary platform, allowing integration of clinical treatment guidelines. Regarding LLMs we tested Llama3-70B and the German-language SauerkrautLM Mixtral 8X7B Instruct which both ran on a Nvidia A100 GPU with 80 GB memory. A set of anonymized data was processed with cloud-based ChatGPT-4 and Claude-3 for comparison.
Results or Findings: Using the privateGPT platform both tested LLMs ran on a single GPU with maximally 65 GB of memory usage. Both LLMs created text summaries within 15 seconds and provided decision support in under 5 seconds per request. For all brain cancer cases the local LLMs provided a correct and reasonable summary of medical history. In decision-making for a prostate tumor board, the decision accuracy amounted to 7 out of 10 test cases. For anonymized data, accuracy between the local LLMs and both ChatGPT-4 and Claude-3 was 8 out of 10 test cases.
Conclusion: Integration of local LLMs into clinical workflow or research task is possible. Local LLMs were able to summarize medical history or clinical data for tumor boards, preserving local data privacy policies.
Limitations: Only two local LLMs were evaluated on sophisticated datasets.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: None
7 min
Training and Evaluation of Sentence Transformer Model for Retrieval Augmented Generation on Radiology Reports
Kamyar Arzideh, Essen / Germany
Author Block: K. Arzideh, H. Schäfer, A. Idrissi-Yaghir, C. S. Schmidt, J. Haubold, R. Hosch, F. Nensa; Essen/DE
Purpose: In many medical settings physicians often have to sift through unstructured documents to find important information. This manual process is time-consuming and can lead to missed details. Retrieval Augmented Generation (RAG) can help physicians to quickly locate relevant information. By using Sentence Transformer models fine-tuned for retrieval tasks, similarity search between input query and document passages can be performed to find relevant context. However, most publicly available models are not specifically fine-tuned for the radiology domain and are therefore very limited in finding clinically relevant information.
Methods or Background: Document chunks from 400,000 German clinical notes including radiology reports and doctoral notes, were provided as input to the SauerkrautLM-SOLAR-Instruct Large Language Model. The model was prompted to generate clinically related questions and answers based on these chunks.

The model generated 11 million clinically related question-answer pairs to fine-tune a multilingual-e5-large model. For evaluation, 1,717 question-answer pairs were generated from 215 radiology reports. A radiologist filtered out unrelated or incorrect pairs for a realistic evaluation. The fine-tuned model was then integrated into a RAG system, and its answers were compared to those from a non-fine-tuned model using the same dataset.
Results or Findings: Fine-tuning the model resulted in improved performance metrics. The BLEURT score increased from 0.551 to 0.563, indicating enhanced alignment with human judgment. Similarly, the BERTScore F1 rose from 0.750 to 0.756.
Conclusion: By using LLM to generate synthetic questions out of real world documents and fine-tuning sentence transformer models on these question and document pairs, information retrieval performance can improve as indicated by automated evaluation metrics.
Limitations: The evaluation was only carried out for documents in German. Fine-tuning on documents written in other languages and from other hospital sites could lead to a broader applicability.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was approved by the Ethics Committee of the Medical Faculty of the University of Duisburg-Essen (approval number 23-11557-BO). Due to the study's retrospective nature, the requirement of written informed consent was waived by the Ethics Committee of the Medical Faculty of the University of Duisburg-Essen. All methods were carried out in accordance with relevant guidelines and regulations.
7 min
Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis
Elif Can, Freiburg Im Breisgau / Germany
Author Block: E. Can1, W. Uller1, K. Vogt1, F. Busch2, N. Bayerl3, A. Kader2, M. R. Makowski2, K. K. Bressem2, L. C. Adams2; 1Freiburg/DE, 2Munich/DE, 3Erlangen/DE
Purpose: To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8x7b), in simplifying 109 interventional radiology reports.
Methods or Background: Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for analysis.
Results or Findings: Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics (all Bonferroni-corrected p-values: p=1), while they outperformed other models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. All models exhibited some trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8x7B the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84).
Conclusion: GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.
Limitations: This study was based on predefined metrics, which, while comprehensive, may not capture all aspects of patient understanding and engagement. Future research should include real-world data, a broader range of medical documents, and consider patient feedback to more accurately assess the clinical utility of these models.
Funding for this study: This study did not receive any specific funding from public, commercial, or not-for-profit sectors.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Since the reports did not include any real patient data, institutional review board approval was not required.

This ensures that the study adhered to ethical standards by avoiding the use of real patient information and thereby eliminating the need for formal ethical approval processes typically required for studies involving human subjects.
7 min
Automated Radiology Controlling - Using Large Language Models for Prediction of Radiological Services based on Radiological Reports
Kamyar Arzideh, Essen / Germany
Author Block: K. Arzideh, A. Idrissi-Yaghir, H. Schäfer, K. A. Borys, J. Haubold, F. Nensa, R. Hosch; Essen/DE
Purpose: In hospitals worldwide, controlling of radiological services is a manual process. In Germany, the so-called “Gebührenordnung für Ärzte” (GOÄ) regulates the billing of private medical or dental services, i.e. services outside the public health insurance scheme. GOÄ numbers can be used to indicate which private related clinical interventions were performed during treatment. These numbers are documented by going through radiological reports and picking out relevant information, which is time-consuming and error-prone.
Methods or Background: In order to automate billing of radiological services, a Large Language Model (LLM) was fine-tuned to generate GOÄ digits out of radiology reports. In total, 1,000,000 radiology reports and GOÄ digit pairs were split into 80 % training and 20 % test dataset. Training was performed on a Phi-3-small-8k-instruct model. For evaluation, the test dataset was compared against the numbers generated by the model.
Results or Findings: The fine-tuned LLM achieved an accuracy of 75 % calculated for the generation of GOÄ numbers. These generated GOÄ codes were identical to the ground truth. 83 % of the predicted codes were present in the ground truth, but may not have been a complete match.
Conclusion: LLM are capable of automatically extracting relevant controlling codings based on radiology reports only. Therefore, LLMs could be used as an enhanced method for the automation of controlling tasks in radiology.
Limitations: The LLM needs human feedback and manual correction in order to achieve human-like results. The radiology reports used in this study were also written in German language. The use of datasets in other languages and from other hospitals could enable broader generalizability.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study adhered to all guidelines defined by the approving institutional review board of the investigating hospital. The Institutional Review Board waived written informed consent due to the study's retrospective nature. Complete anonymization of all data was performed before inclusion in the study.
7 min
Insights and Challenges in Implementing Vision Transformers for Thorax Radiography
Sardi Hyska, Munich / Germany
Author Block: S. Hyska, A. Wollek, T. Lasser, M. Ingrisch, B. O. T. Sabel; Munich/DE
Purpose: This study aimed to evaluate the performance of a Vision Transformer (ViT)-based AI model, trained on publicly available chest radiography datasets, when applied to real-world data from our clinic. The model's performance in detecting pleural effusion, pneumothorax, cardiomegaly, and consolidation was examined, along with potential confounders.
Methods or Background: The AI model, pre-trained on ImageNet and fine-tuned on >700,000 public chest X-rays (CXR), was tested on an internal dataset of 113 CXR, including 23 pneumothorax, 29 cardiomegaly, 31 consolidation, 52 pleural effusion cases, and 29 normal CXR. The model’s performance was assessed through ROC-curves, AUC, Youden Coefficient, and sensitivity/specificity metrics. Logistic regression, odds ratios, and Fisher's test were used to analyse confounding factors.
Results or Findings: The model correctly identified all normal CXRs. For pleural effusion, sensitivity was 96.2% and specificity 98.4%, indicating strong performance. For pneumothorax, sensitivity was only 26.1% with 96.7% specificity. Pneumothorax size and presence of thoracic tubes were significant confounders. Cardiomegaly was detected with 55.2% sensitivity and 96.4% specificity, whereas concomitant pleural effusions, obscuring the heart contours, act as a potential confounder. Consolidation was detected with 45.2% sensitivity and 91.5% specificity, and higher density consolidations were more easily identified.
Conclusion: This study emphasizes the challenges AI models face when integrated into clinical practice, demonstrating the importance of carefully and clinically assessing model performance on real-world-data, especially in the context of confounding factors. While our ViT model showed strong performance for pleural effusion and normal findings, its detection of pneumothorax, cardiomegaly, and consolidation was limited. Known confounders, e. g. pneumothorax size and presence of thoracic tubes, were confirmed, and new ones, such as pleural effusion in cardiomegaly and density of consolidations, were identified.
Limitations: The exploratory nature and limited number of CXR were key limitations.
Funding for this study: This work was funded in part by the German federal ministry of health’s program for digital innovations for the improvement of patient- centered care in healthcare [grant agreement no. 2520DAT920].
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Approval by an ethics committee is present.