Research Presentation Session: Artificial Intelligence and Imaging Informatics

RPS 1805 - Beyond text mining: large language models as diagnostic and prognostic tools in radiology

March 7, 09:30 - 11:00 CET

6 min
When AI Joins the Table: Evaluating Large Language Model Performance in Sarcoma Tumor Board Decisions
Reza Dehdab, Nehren / Germany
Author Block: R. Dehdab, F. K. E. Mankertz, N. Maalouf, S. Afat, C. Deinzer; Tübingen/DE
Purpose: Multidisciplinary tumor boards (MDTs) are critical for the personalized management of soft tissue sarcomas (STS), but they are limited by time, cost, and resource demands. With recent advances in large language models (LLMs) like ChatGPT, there is growing interest in evaluating their potential role in augmenting MDT workflows. This study aimed to assess the clinical performance of ChatGPT-4o in real-world STS cases using predefined evaluation criteria, comparing its treatment suggestions with expert MDT decisions.
Methods or Background: We retrospectively analyzed 152 sarcoma cases presented to a single-center MDT between July 2023 and April 2024; 13 cases were used for prompt development and excluded. ChatGPT-4o generated guideline-based treatment suggestions from anonymized tumor board registration letters. Two blinded experts independently scored outputs using a five-domain framework: diagnostics, therapeutics, sequencing, chemotherapy, and contextualization. Scores were normalized to 1.0. Descriptive statistics and non-parametric ANOVA with post hoc tests assessed performance, including subgroup analysis by sarcoma subtype.
Results or Findings: The final cohort included 138 sarcoma cases (median age: 66; 51% male). The most common subtypes were leiomyosarcoma (n=22), dedifferentiated liposarcoma (n=16), and myxofibrosarcoma (n=15). The median normalized score was 0.857 (IQR: 0.75–1.0), significantly below the maximum achievable score (p < 0.05). Clinical contextualization scored highest (p < 0.05 vs. other criteria). No significant performance differences were observed across sarcoma subtypes (p = 0.138).
Conclusion: ChatGPT-4o showed high but imperfect concordance with sarcoma tumor board decisions, performing best in individualized reasoning. While overall appropriate, gaps in sequencing and chemotherapy selection highlight the need for further refinement before clinical use.
Limitations: Limitations include prompt development on a limited internal case set, reliance on internal expert consensus without external validation, and the absence of testing across different model temperature settings.
Funding for this study: No funding was received for the conduct of this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethical approval was granted by the Institutional Review Board 418/2024BO2
6 min
LLMs Software Performance Evaluation for Prostate mpMRI Reports Interpretation
Benedetta Masci, Rome / Italy
Author Block: B. Masci1, L. Nardoni1, M. Polici1, M. Zerunian1, G. Argento1, D. Caruso1, M. Francone1, A. Laghi2; 1Rome/IT, 2Milan/IT
Purpose: To evaluate a large language model(LLM)-based software for simplifying prostate multiparametric MRI(mpMRI) reports, focusing on clinical accuracy and communication clarity.Special attention was given to correct identification of key diagnostic elements, including PIRADS category and appropriate diagnostic work-up.
Methods or Background: This prospective single-center study analyzed 40 prostate mpMRI reports(January–May 2025) using a custom LLM-based software designed to process clinical data quickly and improve patient understanding.Reports were created by multiple board-certified radiologists to reduce bias and then simplified by the LLM tool.
Two radiologists(Reader1,R1;Reader2,R2) independently evaluated each AI-generated report in six domains:clarity,completeness,capacity of identifying patient-relevant information,accuracy,communicative safety and overall satisfaction,with a 5-point Likert scale.Interpretative errors and correct identification of PIRADS-based diagnostic pathway were also assessed.Statistical analysis included Wilcoxon signed-rank test and inter-reader agreement was assessed.
Results or Findings: Both radiologists evaluated all reports across the domains. Language clarity of LLM-generated reports was moderate (R1: 3.35±0.98; R2: 3.08±0.80;p<0.05).Completeness showed greater variability (R1: 3.85±1.14; R2: 3.58±0.87;p<0.05). Capacity of identifying patient-relevant information was rated 3.25±0.81 (R1) and 3.13 ± 0.65 (R2;p=0.1970). Clinical accuracy scored 3.73±1.09 (R1) vs 3.45±0.82 (R2;p<0.05).Communicative safety was 3.58±0.98 for R1 and 3.23±0.83 for R2 (p<0.05); finally,overall satisfaction was moderate for both readers, with a mean score of 3.43±0.93 for R1 and 3.08±0.76 for R2, (p<0.05).Inter-reader agreement was moderate-to-good(ICC 0.65–0.78).Errors occurred in 35% of reports, mostly minor; 77.5% preserved correct PI-RADS-based diagnostic pathways.These findings highlight the potential to enhance patient communication but support maintaining radiologist oversight.
Conclusion: A tailored LLM-based software showed encouraging performances in simplifying prostate mpMRI reports, with moderate inter-reader agreement. This tool may improve communication and patient empowerment in complex diagnostic settings such as prostate mpMRI, helping reduce communication gaps in imaging workflows.
Limitations: Single software evaluation; no patients' assessments of the AI generated reports
Funding for this study: No funding was received for this study
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethics committee approved
6 min
Democratizing Radiomics: Comparing Expert, “Vibe Coding” (LLM-Generated), and Ai-Assisted Pipelines For Clinical Prediction
Pietro Paolo Azzaro, Rome / Italy
Author Block: P. P. Azzaro, G. Avesani, R. Chianura, M. Dolciami, B. Gui, E. Sala; Rome/IT
Purpose: Evaluate the performance and usability of three approaches for building radiomics machine-learning (ML) pipelines: an expert-crafted pipeline, a semi-automated AutoML tool (CLIMB), and a pipeline generated by a large language model (ChatGPT-o3).
Methods or Background: Radiomic and clinical data from 94 patients with ovarian cancer (1,702 features from the primary mass and metastatic sites) were used to predict BRCA status. The expert pipeline, implemented in Python with PyRadiomics and scikit-learn, employed SHAP-based feature selection and grid-search hyperparameter tuning. The LLM pipeline was created via iterative prompting of ChatGPT-o3 by a radiologist with good radiomics/ML knowledge but limited coding experience. CLIMB was attempted but could not execute because the feature matrix exceeded token capacity. Both working pipelines evaluated four classifiers—Random Forest, Support Vector Machine, Logistic Regression, and XGBoost—within a SMOTE-based workflow to address class imbalance. Five-fold cross-validation generated fold-wise accuracy and precision.
Results or Findings: Mean accuracy across classifiers was 0.76 ± 0.03 for the expert pipeline and 0.73 ± 0.04 for the ChatGPT pipeline; the difference was not statistically significant (p = 0.27). The best individual model in both pipelines was XGBoost (expert 0.78; ChatGPT 0.75). CLIMB yielded no executable model due to input-size limitations. Development effort differed: the expert pipeline required about four programmer-hours and ~220 lines of bespoke code, whereas the ChatGPT pipeline required roughly ten hours of dialog-driven corrections but only ~110 lines. The expert code offered maximal transparency and tunability; the ChatGPT code was functional yet less flexible and relied on the prompt history for reproducibility.
Conclusion: LLM assistance enabled a non-programmer to build a radiomics pipeline with accuracy comparable to an expert benchmark.
Limitations: Single-dataset design, non-deterministic LLM behavior, assessment of only one LLM and one AutoML tool, and no prospective clinical evaluation.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the ethics committee under code 3311
6 min
A major potential of LLMs: determining skeletal biological age in breast cancer to predict vertebral compression fractures
Chengxin Wan, Chongqing / China
Author Block: C. Wan, Z. Zhang, L. Kong, J. Hao, L. Fajin; Chongqing/CN
Purpose: To determine whether an unsupervised large language model (LLM)–inferred skeletal biological age gap (BAG: LLM bone age minus chronological age) from routine CT–based reports predicts incident vertebral compression fractures (VCF) in women with breast cancer and improves risk stratification beyond conventional metrics.
Methods or Background: We retrospectively included 528 consecutive, newly diagnosed, surgically treated breast cancer patients (2018–2024; 46–76 years). Baseline assessments comprised thoracolumbar CT (T12–L2), quantitative CT volumetric BMD, dual-energy X-ray absorptiometry T-scores, bone-metabolism labs and lifestyle data; patients with baseline VCF or bone metastasis were excluded. All variables were transcribed into a structured Chinese radiology/clinic report. A domain-adapted LLM performed prompt-based, label-free inference of skeletal bone age from the textual report (including CT-derived metrics); BAG (years) was computed. The primary endpoint was first low-energy VCF within 3 years. Cox and Fine–Gray models estimated hazard ratios (HR) per 1-year BAG, adjusting for age, body mass index and Hounsfield units. Discrimination was compared against Hounsfield units, DXA and FRAX using C-index, time-dependent AUC and net reclassification improvement (NRI).
Results or Findings: Fifty-four VCFs occurred (3-year cumulative incidence 10.2%). Each 1-year increase in BAG was associated with 6% higher VCF risk (HR 1.06; 95% confidence interval 1.02–1.11; P=0.002). Adding BAG to a base model (age, body mass index, Hounsfield units) improved 3-year C-index from 0.66 to 0.73 (Δ0.07; P<0.001), increased time-AUC by 0.06 and yielded an NRI of 0.15. Results were consistent in competing-risk analysis, bootstrap validation and endocrine-therapy strata.
Conclusion: Unsupervised LLM-inferred skeletal BAG from routine CT reports independently predicts VCF in breast cancer and meaningfully enhances discrimination beyond density-based metrics, supporting targeted post-operative fracture prevention.
Limitations: Single-center, retrospective design; external validation pending; potential residual confounding.
Funding for this study: The First Affiliated Hospital of Chongqing Medical University 2025 Science and Technology Innovation Project
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Informed consent was waived after approval by the Medical Research Ethics Review Committee of the First Affiliated Hospital of Chongqing Medical University (No. 276-02).
6 min
Multidisciplinary Management of B3 Breast Lesions: A Comparative Performance Analysis of General and Custom-Trained LLM Models
Gianmarco Della Pepa, Milan / Italy
Author Block: G. Della Pepa, G. Irmici, M. Cao, C. De Berardinis, E. D'Ascoli, L. Corradini, G. Rossini, C. Depretto, G. P. Scaperrotta; Milan/IT
Purpose: To retrospectively compare GPT-4o, a general-purpose large language model (LLM), and a custom-trained GPT adapted to breast imaging practice, in supporting clinical decision-making for B3 breast lesions. The goal was to assess concordance with multidisciplinary team (MDT) decisions and evaluate clinical utility.
Methods or Background: Clinical, imaging, and histopathological data of consecutive biopsy-confirmed B3 breast lesions discussed at the institutional MDT between February and July 2024 were anonymized and standardized for text-only input.
Two LLMs were tested: GPT-4o and a custom GPT trained using retrospective institutional cases, internal MDT protocols, and international guidelines and consensus.
Each case was submitted to both models using a two-step prompt: generation of management options, followed by a single best recommendation with rationale. Three breast radiologists reviewed outputs and rated option accuracy and recommendation appropriateness (1–5 scale). Concordance with MDT decision was recorded.
Results or Findings: Forty-nine cases were included. Both models generated accurate management options (mean scores: GPT-4o 4.6/5; custom GPT 4.9/5). Appropriateness of the final recommendation was lower for GPT-4o (3.8/5) compared to the custom GPT (4.5/5). Concordance with MDT decisions was 65.3% (32/49) for GPT-4o and 83.7% (41/49) for the custom GPT. Weighted Kappa values were 0.41 and 0.68, respectively. McNemar’s test confirmed a significant difference in concordance (p=0.03); Wilcoxon signed-rank test confirmed the difference in appropriateness (p=0.002). Inter-reader agreement was substantial (Fleiss’ Kappa 0.72).
Conclusion: Both models showed high accuracy in understanding complex cases. While GPT-4o aligned only moderately with MDT decisions, the custom GPT demonstrated improved concordance and appropriateness. Carefully trained LLMs may provide valuable decision support in challenging scenarios, particularly when MDT expertise is not readily available.
Limitations: Single-institution pilot with a small sample size. Broader validation and training on larger, more diverse datasets are needed.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Approved by: Comitato Etico Territoriale Lombardia 4
6 min
Potential value of the commercialised large language models for cancer staging based on head and neck MRI reports
Qiyong Hemis Ai, Hong Kong / Hong Kong SAR China
Author Block: Q. H. Ai, H. H. Leung, H. M. Kwok, K. F. Hung, L. M. Wong, T. Y. So, A. D. King, K. T. Bae; Hong Kong/HK
Purpose: MRI is routinely used for staging head and neck cancer (HNC), which is a vital step for disease management. However, T and N category criteria for HNC staging are complex and require specialized expertise, and so in many institutions only descriptive MRI reports without formal staging are available. This study aimed to evaluate the potential of commercialized large language models (LLMs) in staging HNC based on the descriptive MRI reports for oral cavity cancer (OCC) by comparing the accuracies of LLMs for T and N categorization and overall stage with that of human experts.
Methods or Background: Descriptive contents in 70 eligible MRI reports were retrospectively input to three commercialised LLMs (ChatGPT5.0, ChatGPT4.0 and DeepSeekV3), respectively. The T- and N- categorisation, and overall stage were extracted from the outputs for further analysis. Accuracies of the LLMs for HNC staging were assessed by the gold standard (confirmed by other two senior head and neck radiologists), and compared using McNamar test.
Results or Findings: The LLMs staged all cases based on the 8th edition of the AJCC cancer staging manual. The ChatGPT5.0, ChatGPT4.0 and DeepseekV3 achieved an accuracy of 74.3%, 72.8%, and 57.1%, respectively for T-categorisation; 85.7%, 82.8%, and 61.4%, respectively for N-categorisation; and 75.7%, 72.8%, and 52.8%, respectively for overall stage. Compared with DeepseekV3, ChatGPT5.0 and ChatGPT4.0 showed higher accuracies for T- and N-categorisation, and overall stage (all p<0.05). No differences in accuracies for T- and N- categorization and overall stage between ChatGPT5.0 and ChatGPT4.0 (all p>0.05).
Conclusion: Results suggested that the current commercialised LLMs may not be able to assist HNC staging based on descriptive MRI reports for OCC.
Limitations: Small sample size;
Only tested the LLMs using OCC MRI reports
Funding for this study: Not applicable
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Joint CUHK-NTEC (Ref. 2025-567
6 min
Turning Reports into Labels: LLM-Driven Extraction of Tumor Progression in Lung Cancer Patients using Radiological Reports
Sina Warmer, Essen / Germany
Author Block: S. Warmer, Y. Wen, C. Bojahr, L. Umutlu, J. Haubold, J. Kohnke, K. A. Borys, F. Nensa, R. Hosch; Essen/DE
Purpose: Radiology reports routinely describe tumor response, yet this information remains embedded in unstructured text, limiting its clinical structured availability. Therefore, this study investigates whether a large language model (LLM) can accurately classify tumor progression, regression, or stability from free-text radiological reports using a zero-shot approach.
Methods or Background: 223 radiology reports were randomly selected from a retrospective cohort of 100 lung cancer patients (female=41, 65±9.72 years, NSCLC=96, SCLC=4) who underwent CT or PET/CT imaging between 2003 and 2021. The dataset consisted of thoracic CT scans (44%), PET/CT whole-body scans (34%), and abdominal CT scans (22%). Clinical experts independently annotated reports to indicate whether they showed progression, regression, or stability. A general-purpose, open-source LLM (Qwen3-235B) was prompted in a zero-shot setting to classify each report. Performance was evaluated for two tasks: binary classification (progression vs. no progression) and multiclass classification (progression, regression, stability, non-classifiable). Performance was measured using accuracy and F1 score.
Results or Findings: In the binary classification task, the model achieved 80% accuracy with an F1 score of 0.71 and strong agreement with expert labels. In the multiclass setting, accuracy was 0.74, with per-class F1 scores of 0.74 for progression, 0.9 for regression, and 0.71 for stability. The model achieved 70% accuracy on non-classifiable cases. These results were achieved without fine-tuning, relying solely on zero-shot prompting, which underscores the potential of open-source LLMs for clinical information extraction.
Conclusion: This study demonstrates the potential of zero-shot LLMs to extract structured tumor responses directly from free-text radiology reports. Such models offer a scalable, training-free solution to support longitudinal therapy monitoring and improve access to critical clinical information written in radiological reports.
Limitations: The limitations of the study are the need for further evaluations of different llms and prompting strategies.
Funding for this study: No funding was provided for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the ethics commitee of the University Hospital Essen.
6 min
Large Language Models in Medical Data Structuring: A Case Study on Neuro-Oncological Cohorts
Robert Hahnfeldt, Mönchengladbach / Germany
Author Block: R. Hahnfeldt, M. Schönfeld, T. Schömig, J. P. Janssen, S. Lennartz, D. Maintz, M. Schlamann, K. R. Laukamp, J. Kottlors; Cologne/DE
Purpose: Primary brain tumors form a heterogeneous group of neoplasm. Standardized follow-up protocols lead to growing institutional MRI datasets that reflect local epidemiology and offer a basis for clinical research. Large language models (LLMs) offer a novel approach by enabling automated structuring and real-time summarization of clinical cohorts. This study evaluated LLM use for automated analysis of a neuro-oncological database, including estimation of entity distributions, follow-up frequencies, and progression patterns.
Methods or Background: A total of 248 patients with intracranial neoplasms treated between 2014 and 2023 at a neuro-oncological center were included, with 165 glioblastoma patients. An anonymized institutional database was processed using Claude Sonnet 3.5 to autonomously generate descriptive statistics and cohort-specific estimates of follow-up intervals and progression rates. Model outputs were cross-validated using conventional statistical techniques.
Results or Findings: Patients underwent an average of 11 MRI examinations during their disease course. In glioblastoma cases, the shortest intervals between imaging and clinical progression were observed (mean: 348 days for imaging; 533 days for clinical symptoms). The LLM successfully extracted these patterns and predicted progression timelines that plausible matched manual statistical analysis results.
Conclusion: LLM-based cohort analysis enables reliable, automated extraction of key metrics from routine clinical data. The high concordance with classical statistics underscores the potential of these models to support longitudinal data analysis, resource planning, and individualized follow-up strategies in neuro-oncology.
Limitations: Potential bias due to incomplete or inconsistent documentation in the past
The use of Claude Sonnet 3.5 for statistical analyses is not yet established and validated.
Funding for this study: This study was conducted with project funding from the Else Kröner-Fresenius-Stiftung (Grant-Number: 2023_EKEA.77)
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Symptom-Only Localization of Brainstem Ischemia: LLM vs Neurologists in 109 DWI-Positive Cases
Nedim Beste, Cologne / Germany
Author Block: N. Beste1, T. Dratsch1, J. Kottlors1, P. Floßdorf1, A-M. Konitsioti1, L. Volz1, D. Pinto Dos Santos2, M. Schönfeld1; 1Köln/DE, 2Mainz/DE
Purpose: To evaluate the diagnostic accuracy of large language models (LLMs) in localizing brainstem ischemic lesions based solely on neurological symptoms, compared with experienced neurologists.
Methods or Background: We retrospectively included 109 patients with diffusion-weighted imaging (DWI)-confirmed acute brainstem ischemia. Clinical symptoms were provided to three neurologists and five LLMs (GPT-5, GPT-4, GPT-4.1, GPT-4o, o3, o3 pro), which were tasked to predict lesion site (midbrain, pons, medulla) and laterality (left/right). Accuracy, Cohen’s κ, region-specific performance, and correlations with symptom count were analyzed.
Results or Findings: GPT-4 and GPT-4o achieved the highest overall accuracy (56.0%), outperforming GPT-5 (48.6%), GPT-4.1 (41.3%), GPT-o3 (34.9%), GPT-o3 pro (10.1%), and all neurologists (32.1–36.7%). Cohen’s κ was highest for GPT-4o (κ = 0.29). LLMs performed best in pontine strokes (GPT-4: 74.0%, GPT-4o: 68.8%), while performance in midbrain and medulla lesions was substantially lower. A weak but significant correlation between number of symptoms and prediction accuracy was found for GPT-4 (r = 0.28, p < 0.01), GPT-5 (r = 0.26, p < 0.01), and one neurologist (r = 0.29, p < 0.01).
Conclusion: GPT-4 and GPT-4o outperformed neurologists in localizing brainstem lesions based on clinical symptoms alone, while GPT-5 also exceeded human performance but remained less accurate than GPT-4/4o. Accuracy was modest overall, especially outside pontine strokes.
Limitations: Retrospective design, small cohort size, absence of multimodal input, high percentage of pontine strokes and lack of external validation limit generalizability. Prospective studies with integrated imaging and reasoning-augmented models are needed.
Funding for this study: 1. GPT-4 and GPT-4o reached the highest overall accuracy (56.0%), surpassing GPT-5 (48.6%), other LLMs (41.3–10.1%), and all neurologists (32.1–36.7%) in localizing brainstem lesions based on clinical symptoms alone .
2. Agreement with imaging (Cohen’s κ) was highest for GPT-4o (κ = 0.29).
3. Performance was best in pontine strokes (GPT-4: 74.0%, GPT-4o: 68.8%), but substantially lower in midbrain and medulla lesions.
4. Weak yet significant correlations between number of symptoms and accuracy were found for GPT-4 (r = 0.28), GPT-5 (r = 0.26), and one neurologist (r = 0.29).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Demographic bias in large language models for CT organ assignment: a multicentre diagnostic accuracy study
Mor Saban, Tel Aviv / Israel
Author Block: M. Saban1, Y. Alon1, O. Luxenburg2, C. Singer3, M. Hierath4, A. Karoussou-Schreiner5, B. Brkljačić6, J. Sosna2; 1Tel Aviv/IL, 2Jerusalem/IL, 3Ramat Gan/IL, 4Vienna/AT, 5Luxembourg/LU, 6Zagreb/HR
Purpose: To determine whether large language models (LLMs) exhibit sex- and age-related performance differences when recommending organs for CT and CT angiography (CTA) referrals, compared with clinicians and the ESR iGuide reference standard.
Methods or Background: In this retrospective multicentre diagnostic accuracy study, 5 308 referrals (4 396 CT, 912 CTA) from seven European countries (2022–2023) were analysed. Organs suggested by GPT-4 and Claude-3 Haiku were compared with ESR iGuide recommendations and independent radiologist assessments. Accuracy, precision, recall, F1 score and Cohen’s kappa were calculated with bootstrap 95 % confidence intervals. Subgroup analyses contrasted male versus female and <65 versus ≥65 years. Differences between modalities and subgroups were assessed with permutation and χ² tests (significance p < 0.05).
Results or Findings: Clinicians demonstrated consistently high performance across all strata, achieving a kappa of 0.80 for CT and 0.72 for CTA, with no significant differences observed based on sex or age. Large language models (LLMs) displayed comparable performance for CT, with GPT-4 attaining a kappa of 0.68 and Claude-3 recording a kappa of 0.71. However, performance for CTA declined, resulting in kappa values of 0.56 for GPT-4 and 0.59 for Claude-3 (p < 0.001). Both models exhibited significant variations in accuracy related to sex: for CT, the accuracy was 6–8% higher for males, whereas for CTA, it was 4–6% higher for females (p < 0.01). Furthermore, LLMs favored younger patients in both modalities, with F1 scores being 5–7% higher in the population under 65 years (p < 0.05).
Conclusion: LLMs approach expert performance in CT organ assignment but display clinically relevant demographic biases and reduced robustness in CTA. Mitigation strategies and hybrid human-AI workflows are required before clinical deployment.
Limitations: The study’s retrospective design may limit generalisability.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the Tel Aviv University institutional review board (TAU-0306-CTAI-2024).
6 min
ChexFract: Fracture-Focused Vision–Language Models for Chest Radiography
Ekaterina Petrash, Moscow / Russia
Author Block: N. Nechaev, E. Przhezdzetskaya, D. Umerenkov, V. Gombolevskiy, E. Petrash, D. Dylov; Moscow/RU
Purpose: State-of-the-art vision–language models (VLMs) often miss or underspecify fractures—rare yet clinically critical findings on chest radiographs. We aimed to (i) build a fracture-focused dataset for report generation and (ii) train specialized VLMs that produce precise, structured fracture descriptions, improving clinical utility over general-purpose systems.
Methods or Background: We curated ChexFract, a public set of 18,710 CXR–text pairs with standardized, template-based fracture mentions. Fracture sentences were extracted from original reports and normalized to a schema capturing presence, location, side, stage (acute/healed/other), and implants. As the language backbone we used Phi-3.5 Vision Instruct. We compared two domain visual encoders—Rad-DINO (MAIRA-2) and CheXagent—by training lightweight projection heads and fine-tuning end-to-end for free-text generation. To score model outputs, we parsed generated text to the same schema and computed standard classification metrics against ground truth.
Results or Findings: Fracture-specialized VLMs consistently outperformed general baselines. With the MAIRA-2 encoder, end-to-end fine-tuning achieved ROC-AUC 0.715 and F1 0.629; with the CheXagent encoder, fine-tuned models reached ROC-AUC 0.697 and F1 0.591. General baselines were lower (e.g., MAIRA-2 baseline ROC-AUC 0.518, F1 0.085; CheXagent baseline ROC-AUC 0.604, F1 0.376). Stratified analyses by fracture type and anatomical site (ribs, clavicle, shoulder, spine, sternum, scapula, sternal wires/other) showed complementary strengths across encoders and informed selection for trauma-focused reporting workflows.
Conclusion: ChexFract enables accurate, structured fracture reporting directly from chest radiographs by focusing learning on clinically salient trauma findings. Specialized VLMs trained on ChexFract markedly improve detection and description quality versus general models, with best configurations achieving ROC-AUC ≈0.71 and F1 ≈0.63.
Limitations: Relabeling used an LLM-assisted protocol and may inherit parsing/bias artifacts; external, multi-center validation is pending. Prospective workflow impact, safety, and hallucination controls require clinical studies.
Funding for this study: No external funding was received; the work was conducted as part of the authors’ institutional duties.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: