Research Presentation Session: Artificial Intelligence & Machine Learning & Imaging Informatics

RPS 1605 - Exploring the frontiers of AI-enhanced radiology reporting

March 1, 16:00 - 17:30 CET

7 min
Utilising Chat-GPT4 for conversion of free-text head and neck cancer CT reports into structured reports
Amit Gupta, Ansari Nagar / India
Author Block: A. Gupta, K. Rangarajan, A. Garg; New Delhi/IN
Purpose: The purpose of this study was to assess the performance of generative pre-trained transformer 4 (GPT-4) for conversion of free-text computed tomography (CT) reports of head and neck cancer (HNCa) patients into structured reports using a predefined template.
Methods or Background: We retrieved 50 CT reports of HNCa patients from our department. A structured CT report template for HNCa was prepared enumerating various anatomical sites and their respective subsites. Other key imaging findings were also included - status of cervical lymph nodes, airway compromise and involvement of other neck structures and vessels. In the chat portal of GPT-4, the prompt with best results for structured report generation was selected after prompt engineering. Generated structured reports were evaluated by a radiologist by recording the number of places featuring missing information, misinterpreted information and any additional information not present in the actual report. The reporting template was then modified to explicitly incorporate the areas of mistakes and new GPT-4 responses were recorded.
Results or Findings: GPT-4 successfully converted all 50 free-text reports into structured reports. There were ten places with missing information: tracheostomy tube (n=3), non-inclusion of sternocleidomastoid in strap muscles (n=2), extranodal tumour extension (n=3) and contiguous involvement of neck structures by nodal mass rather than the primary tumour (n=2). Four pieces of information were misinterpreted: abbreviations (n=2) and non-suspicious lung nodules regarded as distant metastases (n=2). GPT-4 did not indicate any additional findings. Upon the appropriate incorporation of missing areas in the reporting template and repeating the prompts, GPT-4 rectified all the reports with no repeated or additional mistakes.
Conclusion: The GPT-4 model can be used to structure free-text radiology reports using plain language prompts and a simple yet comprehensive reporting template.
Limitations: Fine-tuning using the GPT-4 application programming interface (API) was not done in our study.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The Institutional Ethics Committee approved this study.
7 min
Enhancing radiology reporting efficiency through structured reports: a quantitative analysis
Paweł Pawel Paczuski, Warsaw / Poland
Author Block: P. Bombinski1, P. P. Paczuski1, K. Paczuski1, B. Duranc1, A. Kusak2; 1Warsaw/PL, 2Lodz/PL
Purpose: This study explores the impact of structured reporting on radiologists' efficiency, standardisation, and clinician comprehension. We propose and analyse key metrics to quantify the acceleration of report creation using predefined templates and trigger mechanisms.
Methods or Background: Structured reports apply checklist-driven templates for standardised radiological reporting. These templates comprise a checklist of observations and predefined triggers, ensuring systematic reporting. Radiologists can click on checklist items, or trigger larger report segments, such as "norm" for healthy examinations, thereby reducing the need for free text input. Structured reports can be generated using a keyboard, mouse, or voice dictation and commands.
Results or Findings: Our results were based on 10,000 reports of various radiological examinations performed by 20 radiologists. Our proposed metrics for evaluating the efficacy of structured reporting include: number of keystrokes (each use of computer keyboard), number of checklist clicks (each interaction with the checklist), checklist accepted suggestions (number of checklist suggestions included in the final document), contextual accepted suggestions (number of contextual suggestions included in the final document), keystrokes saved, time saved, and total time spent producing the document. Our findings demonstrate that structured reporting significantly reduces keystrokes and accelerates report generation, with an average time saving of 30% compared to conventional keyboard use. Furthermore, 84% of the checklist suggestions were accepted, improving report standardisation and reducing errors.
Conclusion: Structured reporting offers a promising approach to enhance radiologists' reporting efficiency. By utilising predefined templates and triggers, radiologists can create reports more rapidly while ensuring a higher level of standardisation. Clinicians benefit from clearer, more consistent reports, which can lead to better patient care. This study underscores the potential for structured reporting to bring significant advancements in radiology practices, establishing a new benchmark for efficiency and standardisation.
Limitations: No limitations were identified.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: No information provided by the submitter.
7 min
Leveraging GPT-4 for structured radiology reporting: a multilingual proof-of-concept study
Felix Busch, Berlin / Germany
Author Block: F. Busch1, L. C. Adams2, D. Truhn3, A. Kader1, S. Niehues1, M. Makowski2, K. K. Bressem1; 1Berlin/DE, 2Munich/DE, 3Aachen/DE
Purpose: The purpose of this study was to examine the feasibility of automated post-hoc transformation of free-text radiology reports into structured templates using Generative Pre-trained Transformer 4 (GPT-4), a natural language processing model by OpenAI, to standardise reporting language across institutions and enhance data extraction.
Methods or Background: 170 fictional English CT and MRI free-text radiology reports of various body regions and examinations (e.g. MRI of the brain, spine, joints, heart, whole body, and prostate, and CT of the head, chest, spine, thorax, abdomen, and pelvis) were generated by two board-certified radiologists. 23 structured templates were created based on previously published templates and the RadReport Template Library. GPT-4's performance was evaluated based on the accuracy and consistency of the generated structured reports. In addition, GPT-4's performance in chest radiography classification was tested against the medBERT.de German medical language benchmark on 583 German chest radiography reports. All code, JSON report templates, and CT and MRI report texts were made openly available at: https://github.com/kbressem/gpt4-structured-reporting. The web demo application can be accessed at: kbressem.pythonanywhere.com.
Results or Findings: GPT-4 converted all 170 free-text reports into valid JSON files for automatic reading. The model identified all radiology report key findings without any errors or omissions and consistently chose the correct report template based on the free-text report content. In the medBERT.de chest radiography benchmark, GPT-4 surpassed the existing leading model by detecting three pathological findings (congestion, opacity, pneumothorax) and one therapeutic device category (venous catheter).
Conclusion: This proof-of-concept study demonstrates the potential of GPT-4 in post-hoc structured radiology report text transformation, offering a cost-effective and scalable solution for medical database organisation.
Limitations: Restricted access to GPT-4 requires potentially sensitive data to be shared with third parties. GPT-4 is not freely available but is comparatively inexpensive at about $0.10 per report.
Funding for this study: No funding was was received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Ethics approval was not required as the study did not involve patient data.
7 min
Automated anonymisation of radiology reports: comparison of publicly available natural language processing and large language models for HIPAA-compliant data use
Marcel Christian Langenbach, Cologne / Germany
Author Block: M. C. Langenbach1, B. Foldyna2, I. L. Langenbach2, V. Raghu2, T. Neilan2, I. Hadzic2, M. T. Lu2, J. Heemelaar2; 1Cologne/DE, 2Boston, MA/US
Purpose: The purpose of this study was to leverage publicly available offline natural language processing (NLP) methods and a large language model (LLM) to automatically remove PHI from free-text radiology reports to allow for secondary data use compliant with HIPAA regulations.
Methods or Background: We compared two publicly available rule-based NLP models (Google's spaCy; NLPac, accuracy-optimised; NLPsp, speed-optimised; iteratively improved on a test set of 400 randomly selected free-text radiology chest CT reports) and one offline LLM-model (Llama-2, Meta-AI) for PHI-anonymisation. The three models were evaluated on a test set of 100 new randomly selected chest CT reports. Precision, recall, and F1-scores were calculated. Two investigators adjudicated anonymisation performance based on three PHI entities (dates, medical record number (MRN), and accession numbers (ACC)) and whether relevant data was deleted.
Results or Findings: NLPac and NLPsp successfully removed all instances of highly sensitive PHIs (dates (n=333), MRNs (n=6), ACCs (n=92)) from the test set. The LLM-model removed all MRNs, 96% of ACCs, and only 32% of dates. NLPac was the most consistent model, with a perfect F1-score of 1.00 for MRN, ACC, and dates, followed by NLPsp, which had lower precision (0.86) and F1-score (0.92) for dates with non-dates classified as dates in 54 instances (28 cases). The LLM-model had perfect precision for all PHIs but the lowest recall of 0.96 for ACC (missed 4 instances in 4 cases) and 0.52 for dates (missed 134/333 instances in 69 cases) (F1 scores 0.98 and 0.68, respectively). Importantly, NLPac and NLPsp did not remove relevant medical information, while the LLM-model removed relevant information in 10% (n=10).
Conclusion: Pre-trained publicly available NLP models can effectively anonymise free-text radiology reports, while anonymisation with an LLM is more prone to remove non-PHI data.
Limitations: This was a pilot study involving only chest CTs.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the institutional review board (IRB no. 2023P002169) with a waiver of written informed consent.
7 min
A RAdiology Data EXtraction (RADEX) tool for fast and accurate information curation from free-text reports: case study on thyroid ultrasound examinations
Lewis James Howell, Leeds / United Kingdom
Author Block: L. J. Howell, A. Zarei, T. M. Wah, S. Karthik, H. H. L. Ng, J. McLaughlan; Leeds/UK
Purpose: Extracting information from 'free-text' radiology reports is important for service evaluation, audit, unbiased cohort selection, case retrieval, and translational research including labelling medical datasets for artificial intelligence analysis. While machine learning methods have potential for automating this task, reliance on large labelled datasets and specific computing requirements limits their usefulness. Methods using human-defined rules offer a practical alternative, enabling better utilisation of information-rich radiology reports.
Methods or Background: Our tool, RAdiology Data EXtraction (RADEX), leverages clinicians' domain expertise for information extraction. It uses regular expressions (regex) for efficient and flexible text pattern-matching, including wildcard and proximity searches, Boolean logic, and negation handling. This rule-based approach enables clinical users to define complex queries without specialised software knowledge, giving an easy-to-understand method which allows predictions to be reviewed and rules updated in response to changing requirements and terminology. This transparency is vital for building trust and ensuring regulatory compliance.
Results or Findings: RADEX was applied to neck and thyroid ultrasound reports performed between 2015-2019 across five different hospitals. Nineteen sonographic observations were classified, including presence and multiplicity of thyroid nodules, British Thyroid Association thyroid nodule grading(s), altered thyroid echotexture, thyroiditis, thyroidectomy, nodal abnormality, and parathyroid adenomata. On an expert-labelled dataset of 400 reports, RADEX achieved >90% accuracy in all classes. Processing >10,000 reports took less than 60 seconds on a standard laptop.
Conclusion: This free open-source tool provides a scalable approach to extracting structured data from free-text reports, prioritising usability and explainability. It leverages regex's powerful pattern-matching without requiring knowledge of its complex syntax, suiting research and audit tasks where free-text information is key to understanding, but manual review is time-consuming and expensive.
Limitations: The main limitation of the study is that generalisability to other datasets/languages was not evaluated.
Funding for this study: Funding was provided by the UK Research and Innovation (UKRI) Engineering and Physical Sciences Research Council (EPSRC).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: REC review is not required
7 min
Large language models for structured reporting with speech recognition: a comparative feasibility study
Benedikt Kämpgen, Würzburg / Germany
Author Block: B. Kämpgen1, F. Jungmann2, D. Feiler3, I. Schmittel1, J. Stöckmann3, G. Arnhold2, P. Mildenberger2, C. Düber2, T. Jorg2; 1Würzburg/DE, 2Mainz/DE, 3Munich/DE
Purpose: Conventional structured reporting (SR) using a mouse and keyboard is too time-consuming for broad user acceptance. In 2023, a dialogue system which allows radiologists to use speech recognition to fill SR-templates instead was introduced (T Jorg et al., Insights Imaging DOI: https://doi.org/10.1186/s13244-023-01392-y). However, the effort of training this NLP-based system for additional SR-templates is high, e.g., modelling of concepts, synonyms, and implicit knowledge.
Methods or Background: We extended the dialogue system with a state-of-the-art causal Large Language Model (LLM), OpenAI GPT-4, with a suitable prompt asking to translate from text to an SR-template in JSON, and compared the performance of the original system with the extended one.
Results or Findings: The extended LLM dialogue system showed slightly lower F1 score / precision / recall compared with (Jorg et al. 2023) on the same evaluation dataset comprising 82 fictional (-0.18 / -0.29 / -0.21) and 50 real examples (-0.09 / -0.19 / +0.03) of urolithiasis CT reports, with LLM-based fictional (0.80 / 0.70 / 0.75) and real (0.81 / 0.77 / 0.86) versus original fictional (0.98 / 0.99 / 0.96) and real (0.90 / 0.96 / 0.83).
The LLM had difficulties with implicit information; therefore, a inconspicuous kidney did not automatically lead to the negation of pathologies such as obstructive uropathy. Also, the LLM would hallucinate "round" calculi, or assume "no calculi" in abnormal kidneys.
Conclusion: The LLM-based dialogue system requires substantially less effort of training for new templates by only requiring a suitable prompt and JSON representation, without substantial loss of quality. A challenge for its application is to control implicit knowledge and hallucinations.
Limitations: The study included only one closed-source LLM; beyond speech-to-structure, the LLM's generative capabilities to interact with users were not evaluated.
Funding for this study: Funding was provided by the Bundesministerium für Bildung und Forschung (BMBF), 2022-2025, grant agreement number: 16SV9045, project KIPA.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: The study did not require professional legal advice from the Institutional Review Board, or informed consent of patients, according to the state hospital law. All patient data were fully de-identified and retrospectively analysed.
7 min
Using large language models to improve quality and actionability of radiology reports
Kalyan Sivasailam, Bangalore / India
Author Block: N. Kumarasami, P. N., K. Sivasailam, B. Subramanian; Bangalore/IN
Purpose: The objective of this study was to provide radiologists with fine-tuned large language models (LLMs) to enhance the quality, clarity, and actionability of radiology reports, with specific focus on CT Abdomen reports. Current radiology reporting methods can lead to ambiguities or misdiagnoses, especially in a remote diagnostics/teleradiology set-up. The physician/surgeon is looking for a detailed qualitative and quantitative description of a finding based on his/her suspicions and the patient's symptoms in order to arrive at a narrower set of differential diagnoses, as well as the appropriate procedure(s) he/she may follow in case of surgical intervention. Our focus was on understanding the mechanics and technical architecture behind the integration of LLMs into radiology workflows to transform the findings of a pathology into a very detailed and actionable description that is useful and relevant for the referring physician/surgeon.
Methods or Background: The authors fine-tuned a foundational model and built a radiology-specific large language Model, focused on CT Abdomen, using real-life reports and templates. Initially, the LLM was fine-tuned with a data set comprising 4,500 question-answer pairs curated by the authors using instruction fine-tuning methodology. Subsequently, a retrieval-augmented generation method was employed, refining the models with 120,000 real-world reports. In the practical set-up, radiologist interact with a chatbot-like interface and input the pathologies. Using patient history, an initial draft report materialises using the LLM. Radiologist continue responding to the chatbot culminating in a comprehensive report encompassing differential diagnoses.
Results or Findings: The LLMs were deployed in a remote diagnostics setup at 5C Network, India. Productivity went up by 270%. Queries from referring physicians dropped by 76%.
Conclusion: Incorporating LLMs into radiology workflows significantly enhances report clarity and accuracy, offering a promising avenue for optimised patient care and streamlined diagnostic processes.
Limitations: The set-up relies on radiologists identifying the primary pathologies correctly.
Funding for this study: Funding was received from 5C Network Private Limited, India.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: No information provided by the submitter.
7 min
A natural language processing pipeline to extract relevant information from mammography reports
Nikola Cihoric, Bern / Switzerland
Author Block: N. Cihoric, D. Reichenpfader, K. Nairz, R. Gaio, P. Rösslhuemer, G. Cereghetti, H. Bonel, H. Von Tengg-Kobligk, K. Denecke; Bern/CH
Purpose: Although mammography reporting is highly standardised, it results in mostly unstructured reports that are difficult to process automatically. Our aim is to extract relevant information from mammography reports and make it available in a structured format.
Methods or Background: We established a framework for definition and extraction of facts from the unstructured radiology reports adopting rules and specifications from the German version of BIRADS Atlas. We defined an annotation schema that ensures identification of relevant phrases in a report and subsequent information extraction at a high quality through an iterative and counter check approach. This manual annotation is supported by an automated pre-annotation to simplify handling of common phrases. The identified phrases were mapped to a standard terminology based on common data elements (CDEs) to fill a structured form with extracted information.
Results or Findings: BERT-based large language models were then pre-trained and fine-tuned with annotations from 210 mammography reports. Thereby we also generated a LLM based on 100,000 reports in German retrieved from our hospital. An in-depth analysis will be presented.
Conclusion: Our annotation approach separates extraction of information from the template filling, which reduces model complexity and permits independent improvement of both tasks. The implemented pipeline is generalisable and will allow us to structure other types of radiology reports as well. The structured information can be used for follow-up tasks such as decision support, quality assessment or outcome prediction.
Limitations: Our large language model is almost exclusively based on German language and it is trained on texts originating from a single hospital.
Funding for this study: Funding was received from the Innosuisse Project "Smaragd", 59228.1.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Kantonale Ethikkommission Bern approved this study.
7 min
Automatic structuring of radiology reports with on-premise open-source large language models
Piotr Woznicki, Warszawa / Poland
Author Block: P. Woznicki1, C. Laqua1, I. Fiku1, A. Hekalo1, T. Akıncı D'Antonoli2, D. Pinto Dos Santos3, B. Baeßler1, F. C. Laqua1; 1Würzburg/DE, 2Basel/CH, 3Cologne/DE
Purpose: Large language models (LLMs) have successfully been used to extract structured elements from plain text. However, data protection regulations restrict the use of commercial LLMs on patient data. This study evaluated state-of-the-art, on-premise LLMs for automatically structuring free-text radiology reports.
Methods or Background: We applied a novel approach to controlling the LLM output, ensuring the validity of nested structured reports produced by a locally hosted Llama-2 model. We compiled a data set of chest radiographs (CXR) including 200 English reports from a publicly available MIMIC-CXR data set and 200 de-identified German reports from a university hospital. A detailed, nested reporting template, containing 61 fields, was prepared. Ground-truth reports were annotated by a consensus of radiologists. LLM was compared to two human readers (a junior resident in cardiology and a radiographer). Bayesian inference (Markov Chain Monte Carlo sampling) was used to calculate Matthew's correlation coefficient (MCC) from contingency tables, setting (-0.05;0.05) as the region of practical equivalence (ROPE).
Results or Findings: The average MCC of the LLM was 0.87 (94% HDI: 0.83; 0.90) for English and 0.67 (0.60; 0.73) for German reports. MCC differences were all overlapping ROPE for English: LLM-Human1 0.012 (-0.037; 0.061), LLM-Human2 -0.002 (-0.05; 0.05), Human1-Human2 -0.01 (-0.07; 0.04), and German reports: LLM-Human1 0.001 (-0.08; 0.08), LLM-Human2 -0.065 (-0.157; 0.027), Human1-Human2 -0.066 (-0.157; 0.026).
Conclusion: Post-hoc structuring of English CXR reports using local, open-source LLMs is feasible and on par with human readers. However, German reports were more challenging for the model. The understanding of semantics showed variability across specialties and languages.
Limitations: The study's small sample size as well as the fact that some reports lacked information on certain findings and were inconclusive or ambiguous were identified as limitations.
Funding for this study: This work was funded by the German Federal Ministry of Education and Research (Project: SWAG, 01KD2215A).
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was approved by an ethics committee (nr: 20221004 02). The need for individual informed consent was waived.
7 min
Integrating AI results into standardised structured radiology reports: feasibility and implementation
Cyril Thouly, Sion / Switzerland
Author Block: C. Thouly1, B. Dufour1, B. Rizk2, D. Goyard3, P. Petetin4, H. Brat1, F. Zanca1; 1Sion/CH, 2Villars-sur-Glane/CH, 3Paris/FR, 4Berre l'Etang/FR
Purpose: One of the main challenges the industry of radiology currently faces is the integration of AI results into clinical workflow. Healthcare professionals navigate multiple systems and interfaces (PACS, RIS, AI report), with frequently inefficient workflows. We aimed at demonstrating the feasibility and effectiveness of integrating AI-derived results into standardised structured reports (SSR) for radiology, enhancing clinical workflow and reporting accuracy.
Methods or Background: A collaboration was initiated among a RIS provider, an AI platform provider, and our R&D department within a multicentric radiology network. The structured AI results were sent to the RIS via HL7 ORU messages (TCP protocol) and one message was generated per analysis. Each element of the AI structured result was placed in an OBX segment of the HL7 message. We use PatientID and AccessionNumber to link images on the PACS and radiology report in the RIS. Segments were subsequently incorporated into SSR using a beacon in the RIS, undergoing multiple iterations for layout, wording, and punctuation accuracy. The percentage of AI pre-populated fields of SSR was estimated.
Results or Findings: AI results were promptly transmitted to the RIS as HL7 messages. On accessing the report in the RIS, radiologists encountered prepopulated SSR subsections. Currently 40 bone age and 140 knee MRI SSR templates were successfully integrated into clinical workflows. For bone age as well as for knee MRI, the percent of pre-populated report was 60%.
Conclusion: Seamless integration of AI results into SSRs is achievable during routine clinical workflows. The active involvement of radiologists ensures that resultant prepopulated reports align with their requirements.
Limitations: The success of this integration hinges on AI vendors delivering structured and standardised results. Inaccurate AI results present potential liability concerns for radiologists due to the risk of transmitting unchecked erroneous reports.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: No information provided by the submitter.
7 min
Structured reporting for efficient epidemiological and in-hospital prevalence analyses of pulmonary embolism
Tobias Jorg, Mainz / Germany
Author Block: T. Jorg, M. C. Halfmann, D. Graafen, C. Düber, P. Mildenberger, L. Müller; Mainz/DE
Purpose: Structured Reporting (SR) not only offers advantages regarding the report quality but, as an IT-based method, also the opportunity to aggregate and analyse large, highly structured data sets (data-mining). In this study, a data-mining algorithm was used to calculate epidemiological data and in-hospital prevalence statistics of pulmonary embolism (PE) by analysing structured CT reports.
Methods or Background: All structured reports for PE CT scans from the last 5 years (n = 2790) were extracted from the SR database and analysed. The prevalence of PE was calculated for the entire cohort and stratified by referral type and clinical referrer. Distributions of the localisations of PEs (central, lobar, segmental, subsegmental, left-sided, right-sided, bilateral) were calculated, and the occurrence of right heart strain was correlated with the localisations.
Results or Findings: The prevalence of PE in the entire cohort was 24% (n = 678). The median age of PE patients was 71 years (IQR 58 – 80). The sex distribution was 1.2/1 (M/F). Outpatients showed a lower prevalence of 23% compared to patients from regular wards (27%) and intensive care unit (30%). Surgically referred patients had a higher prevalence than patients from internal medicine (34% vs 22%). Patients with central and bilateral PEs had a significantly higher occurrence of right heart strain compared to patients with peripheral and unilateral embolisms.
Conclusion: Data-mining of structured reports is a simple method by which to obtain prevalence statistics, epidemiological data, and the distribution of disease characteristics, as demonstrated for the use case of PE. The generated data can be helpful for multiple purposes, such as for internal clinical quality assurance or scientific analyses. To benefit from these, consistent use of SR is required and therefore recommended.
Limitations: The study is limited by its single-centre design.
Funding for this study: This study received no outside funding
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: This study did not require professional legal advice from the Institutional Review Board, or the informed consent of patients, according to state hospital law. All patient data were fully de-identified.
7 min
Extracting information from unstructured MRI reports with a local open-source GPT model
Bastien Le Guellec, Lille / France
Author Block: B. Le Guellec, A. Lefevre, C. Bruge, L. Hacein-Bey, J-P. Pruvo, G. Kuchcinski; Lille/FR
Purpose: We set out to use a local open-source GPT model to automate information extraction tasks from unstructured MRI reports. We calculated its performance on reports from emergency brain MRIs performed for patients with headaches.
Methods or Background: All consecutive radiological reports from a French quaternary centre in 2022 were retrospectively reviewed. Two radiologists identified MRIs that were done for headaches. Four radiologists scored reports' conclusions as normal or abnormal. Abnormalities were labelled as either headache-generating or incidental. In parallel, Vicuna, an open-source GPT large language model, performed the same tasks. Vicuna's performances were evaluated using the radiologists' consensus as the gold standard.
Results or Findings: A total of 2398 reports were identified, of which 595 included headache in their indication. Median patient age was 35; 68% were female. The overall rate of causal findings in outpatients with headache was 23% (135/595). Our GPT-based method had an accuracy of >95% for simple information extraction tasks such as indication of the exam, patient sex and age, use of contrast medium injection and study categorisation as normal or abnormal. Vicuna's accuracy was 82% for the most complex task of causality inference between an abnormal MRI finding and symptoms.
Conclusion: We found that an open-source GPT model can extract information from radiological reports with excellent accuracy without further training. We hypothesise that this method could also be applied to any information extraction task relying on unstructured medical records.
Limitations: Due to the monocentric design of our study, we could not test for variability in reporting styles or language. Further studies will be needed to explore the adaptability of the proposed framework, even though it is expected to be high based on ability of generative language models to handle various languages seamlessly.
Funding for this study: No specific funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was approved by the IRB of Lille University Hospital.

This session will not be streamed, nor will it be available on-demand!