Research Presentation Session: Artificial Intelligence and Imaging Informatics

RPS 605 - Does AI really help? Real-world performance data in fracture diagnostics

March 4, 16:30 - 17:30 CET

6 min
Retrospective validation of three artificial intelligence -based fracture detection systems using local data
Juuso Heikki Jalmari Ketola, Helsinki / Finland
Author Block: J. H. J. Ketola, S. Inkinen, T. Mäkelä, K. Pohto, M. Kortesniemi, S. Syväranta; Helsinki/FI
Purpose: Artificial intelligence (AI)-based fracture detection systems are being implemented worldwide to accelerate diagnostic workflows and improve accuracy. In this study, we evaluated three commercial AI fracture detection systems retrospectively using a diverse local trauma X-ray dataset. The objective was to evaluate the performance of each system and to assess their respective strengths and limitations.
Methods or Background: A local dataset comprising 1,891 trauma X-ray images (675 adults, 1,216 paediatric) spanning various anatomical regions was processed using three different AI systems. AI results were compared to primary radiologist reports, calculating accuracy, sensitivity, and specificity for each solution. An experienced radiologist verified any discrepancies, and fractures identified by AI but not included in the original reports were classified as additional AI-findings.
Results or Findings: For adults, accuracies ranged from 0.94 to 0.96, sensitivities from 0.89 to 0.93, and specificities from 0.96 to 0.97. In paediatric cases, accuracies ranged from 0.96 to 0.97, sensitivities from 0.89 to 0.95, and specificities from 0.96 to 0.98. AI-enhanced detection rates for adults ranged from 4.9% to 5.9%, and for paediatrics from 1.1% to 2.1%. Performance varied by anatomical area, bone, and fracture type.
Conclusion: All three solutions showed similar performance, differing only by a few percentage points. Depending on the clinical use-case, prioritizing either higher sensitivity or specificity may be preferable. Our findings underscore the importance of validating AI fracture detection systems with local data to reveal their unique advantages and limitations.
Limitations: Testing was performed using fixed operating points. More comprehensive evaluation based on receiver-operating characteristics was not feasible due to unavailable raw output probabilities. A detailed cost-benefit assessment would require a longer, prospective testing period.
Funding for this study: No external funding
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Role of artificial intelligence in assisting non-MSK specialist radiologists with tibial plateau fracture detection
Ana Isabel Hernáiz Ferrer, Pavia / Italy
Author Block: A. I. Hernáiz Ferrer, E. M. Preda, D. Meccariello, J. Bosio, L. Carone, C. Bortolotto, L. Preda; Pavia/IT
Purpose: We evaluated the diagnostic performance of two AI software programs, BoneView (AI-1) and RBfracture (AI-2), in assisting two non-specialist radiologists (NSR-1 and NSR-2) with the detection of tibial plateau fractures on conventional knee X-rays. These fractures are often challenging to identify due to their subtle appearance.
Methods or Background: In this retrospective monocentric study we analyzed 673 radiographs from 324 patients with knee trauma. All patients included in the study underwent a knee CT scan after the X-ray, which served as the gold standard. Diagnostic performance was assessed using sensitivity, specificity and area under the curve (AUC).
Results or Findings: The average patient age was 62 years, and 52.5% were female. CT scans confirmed tibial plateau fractures in 145 patients (44.8%) with AO/OTA B2 fractures being the most common type (35 patients, 24.1%).
When evaluating X-rays, the AI tools performed similarly to NSRs (AUC: AI-1=0.88, AI-2=0.86, NSR1=0.86, NSR2=0.78). AI significantly improved the diagnostic performance of NSR-2 when combined with AI-1 and AI-2 (AUC=0.85 and 0.86; p=0.0001 and p=0.001, respectively), and of NSR-1 when combined with AI-2 (AUC=0.90; p=0.018).
The overall performance of both AI systems for detecting other fracture types visible on radiographs (femur, patella, fibula) was also evaluated. AI-1 achieved an AUC of 0.842 and AI-2 an AUC of 0.798 (p = 0.017). Since AI-2 also detects lipohemarthrosis, we assessed its performance, finding a sensitivity of 61.7%, specificity of 98.5%, and an AUC of 0.801.
Conclusion: Both AI tools can assist non-specialist radiologists in detecting tibial plateau fractures on standard X-rays. They can also detect other fractures, with AI-2 showing particular promise in improving specificity through the detection of lipohemarthrosis.
Limitations: This study has some limitations, including its retrospective design. The results presented are preliminary.
Funding for this study: This study received no financial support and was conducted using free trial versions of the AI software.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the IRB Comitato Etico Territoriale Lombardia 6.
6 min
Budget-Impact Analysis of AI-Supported Fracture Detection: A Multi-Center Study in Moravian–Silesian Hospitals
Daniel Kvak, Praha / Czechia
Author Block: M. Rehor1, J. Orság2, D. Kvak2; 1Jindřichův Hradec/CZ, 2Prague/CZ
Purpose: We quantified the one‑year budget impact and return on investment (ROI) of deploying an AI tool for fracture detection in emergency musculoskeletal radiography across five Moravian–Silesian hospitals.
Methods or Background: We built a decision‑tree comparing standard care (radiologist‑only; sensitivity 82.4%, specificity 95.7%) with an AI‑assisted pathway in which all 339,828 annual MSK X‑rays were pre‑read by commercial software (Carebot AI Bones; €1 per scan; sensitivity 92.1%, specificity 89.7%). AI‑positive studies proceeded to radiologist confirmation, whereas AI‑negative studies were cleared. First‑diagnosis fracture prevalence was set at 6% (20,390). Downstream costs comprised an emergency revisit per false negative (€150) plus a one‑day admission for 25% of false negatives (€194), and outpatient referrals for 10% of false positives (€20). Radiologist time savings were valued at one minute per avoided read at €27/hour. Litigation costs were analysed separately and excluded from the primary model.
Results or Findings: Relative to standard care, AI assistance reduced false negatives by 55% (3,587 to 1,614) and increased false positives by 140% (13,735 to 32,891). False‑negative costs fell from €711,954 to €320,379 (saving €391,575). False‑positive referrals increased from €27,470 to €65,782 (increment €38,312). Avoiding 288,161 reads saved 4,803 radiologist hours, valued at €129,681. After AI fees of €339,828, net regional savings were €143,116 (≈€28,623 per hospital). The benefit–cost ratio was 1.42, corresponding to an ROI of 42%. At €0.75 and €0.50 per scan, ROI rose to 90% and 184%, respectively.
Conclusion: In high‑throughput emergency workflows, the AI triage model can halve missed fractures, lower downstream care costs, and deliver a positive ROI while focusing human review on AI‑positive studies.
Limitations: Assumptions include negative‑clearing without routine spot checks, static test performance, tariff‑based unit costs, and single‑region generalisability; prospective, real‑time validation is required.
Funding for this study: Carebot s.r.o. supported software integration and modelling; no public grant funding was received.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Comparison of deep learning object detection models for fracture diagnosis in X-ray images
CHANGMIN JEON, Seoul / Korea, Republic of
Author Block: C. JEON, J. PARK; Seoul/KR
Purpose: Automated fracture detection via deep learning-based object detection can enhance diagnostic efficiency from radiographs. While one-stage models offer real-time processing, two-stage models provide higher precision. This study compares YOLOv8, RetinaNet, and Faster R-CNN on the FracAtlas X-ray dataset to determine the model offering the optimal trade-off between accuracy and practical applicability for clinical use.
Methods or Background: Object detection models were trained on the FracAtlas dataset (4,083 radiographs, 922 annotated fractures) using Python 3.9.0 with Detectron2 and Ultralytics frameworks. Model performance was assessed via confusion matrix-based indicators (Precision, Recall, F1-score), mean Average Precision at IoU 0.50 ([email protected]), and inference speed (frames per second, FPS).
Results or Findings: Faster R-CNN achieved the highest diagnostic accuracy with [email protected] = 0.82, Precision = 0.92, Recall = 0.75, and F1-score = 0.82, but had a limited inference speed of 5 FPS. YOLOv8 demonstrated real-time performance with 45 FPS but lower accuracy ([email protected] = 0.62, Recall = 0.57), particularly struggling with subtle fracture detection. RetinaNet produced intermediate results, yielding [email protected] = 0.67 and 10 FPS. The superior feature extraction of Faster R-CNN underscores the clinical benefit of two-stage approaches when diagnostic accuracy is critical.
Conclusion: Faster R-CNN delivers the highest fracture detection accuracy, making it well-suited for clinical use, though its low FPS restricts real-time deployment. One-stage models excel in speed but fall short for complex diagnostic demands. Future research should explore lightweight optimisation of Faster R-CNN and larger datasets to enhance generalizability. Two-stage models remain valuable for high-precision medical image analysis.
Limitations: This study was limited to a single FracAtlas X-ray image dataset and an imbalanced fractured-to-non-fractured ratio (1:4.6), which may affect model performance.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:
6 min
Systematic Assessment of the Medical Utility of Radiology Artificial Intelligence in Fracture Detection (SAMURAI-Fracture): A multi-centre prospective cluster randomised cross-over trial
James Vaz, Gerrards Cross / United Kingdom
Author Block: J. Vaz1, S. Ather1, M. J. Lundemann2, T. Bentabol Munoz1, S. A. Beer1, A. Espinosa1, K. Nash1, A. Novak1; 1Oxford/UK, 2København K/DK
Purpose: To evaluate whether implementation of an AI-assisted fracture detection tool reduces unnecessary healthcare contacts arising from misdiagnosed fractures in emergency departments (EDs) and minor injuries units (MIUs).
Methods or Background: A multicentre, prospective, cluster randomised cross-over trial is being conducted across four NHS Trusts, encompassing level 1 EDs and MIUs.

All patients aged ≥2 years undergoing X-ray for suspected fracture are eligible, with exclusions for skeletal surveys for non-accidental injury and skull, facial, dental, and cervical spine radiographs.

Approximately 45,000 patients are expected to be recruited during a six-month period (October 2025–March 2026).

An MHRA-approved AI tool is integrated into PACS at each site.

Randomisation determines whether sites commence with AI active or inactive, alternating monthly between “on” and “off” periods. During “on” periods, clinicians access AI-annotated images as adjunctive decision support; during “off” periods, standard practice applies.

The primary outcome is the proportion of unnecessary NHS contacts, defined as re-attendances or referrals resulting from false negatives and false positives.

Secondary outcomes include diagnostic accuracy compared with radiology reports, subgroup analyses by anatomical region and demographics, length of stay in ED, and patient/clinician experiences captured via questionnaires.
Results or Findings: Results will compare unnecessary NHS contacts, diagnostic accuracy, and service outcomes between AI “on” and “off” periods.
Conclusion: This study will provide the first large-scale prospective evidence on the clinical, service, and health-economic impact of AI fracture detection in real-world NHS settings.
Limitations: As a pragmatic cluster trial, differences in staffing, patient mix, or workflow between periods may influence results.
Funding for this study: Small Business Research Initiative (SBRI) Healthcare Urgent and Emergency Care Grant
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: UK Research Ethics Committee Approval Obtained (IRAS 357391)
6 min
Multi-center Post-Implementation Monitoring of an AI Fracture Detection algorithm on Trauma Patients: Workflow Redesign, Safety Signals and Governance
Jonas Vardal, Drammen / Norway
Author Block: R. Sivanandan1, J. Vardal2, L. Tveiten2, K. G. Brurberg2, B. A. Graff2; 1Asker/NO, 2Drammen/NO
Purpose: To evaluate the performance, workflow impact, and patient safety of a CE-marked AI application for fracture detection on x-rays with history of trauma following deployment across two hospitals within a regional healthcare trust.
Methods or Background: A retrospective, multi-center post-implementation study was conducted over two months at Site-A (n=1284) and Site-B (n=1177) hospitals. AI analyzed trauma-related X-rays, with radiologist reports serving as ground truth. Patient data included demographics, X-ray region, AI/radiologist findings, fracture types, additional observations, patient disposition, and follow-up imaging. Workflow was redesigned to allow patient discharge based on AI results obtained within two minutes approx., while radiologist reports were issued later the same or next day, enabling faster clinical decision-making with post-hoc expert validation.
Results or Findings: Fracture prevalence among the study population (2-99 years) was 34–37% (≈450 – 480 per site). AI and radiologist results showed high overall concordance across both hospitals 86 – 89%. However, 2.4 % (23-35 per site) cases were missed by AI; in 0.9% (11-12 per site) of cases, patients were prematurely discharged based on AI results. Four cases with fractures were recalled for treatment later, while the rest 0.8% (19 cases) had conservative treatment with follow-up scan following radiologists’ report. Workflow redesign reduced discharge time substantially compared with historical baseline, with consistent patterns across hospitals.
Conclusion: Multi-center monitoring confirmed consistent AI performance and notable workflow acceleration, while also identifying safety concerns such as missed fractures—underscoring the critical need for radiologist. These findings emphasize the importance of structured post-deployment governance to improve operational efficiency and human oversight involvement in AI-integrated patient care.
Limitations: Limitations include the restricted one-month observation periods at each site and absence of long-term or continous monitoring of patient outcome data, which could be addressed in future.
Funding for this study: Not funded
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ref-ID 23/07690-3, from the data protection officer in Vestre Viken Hospital Trust (HT)
6 min
Comparative evaluation of commercial AI algorithms for fracture detection on musculoskeletal radiographs
James Vaz, Gerrards Cross / United Kingdom
Author Block: J. Vaz, K. Nash, A. Novak, C. Mihaiu, N. Salik, S. Ather; Oxford/UK
Purpose: To conduct a comparative evaluation of commercial AI tools designed to detect fractures on musculoskeletal radiographs, assessing diagnostic performance across vendors.
Methods or Background: Six commercial vendors were invited of which three participated in this retrospective evaluation. A dataset of 500 anonymised radiographs was collected from Oxford University Hospitals and enriched to ~50% fracture prevalence to ensure anatomical and pathological diversity.

Each case's ground truth was labelled through independent review by two musculoskeletal radiologists (>10 years’ experience), with arbitration from a third where required. All images were processed in batch by vendors. Vendors had up to 72 hours to return results.
Results or Findings: All participating vendors returned results within the 72-hour period. One vendor processed all images within their intended use case; whilst two vendors did not return results for between 16-18 cases.

Accuracy ranged from 80.4% (95% CI: 76.4-83.8) to 87.2% (83.9-89.8). Sensitivity ranged from 70.4% (63.5-76.5) to 89.7% (84.8-93.2) and specificity from 73.9% (68.5-78.7) to 87.7% (83.1-91.2).

McNemar's test revealed significant differences in accuracy, sensitivity, and specificity across tools (p <0.001), with some tools excelling in sensitivity and others in specificity.
Conclusion: Multiple commercial AI tools demonstrated high overall accuracy in detecting fractures on radiographs. However, there was notable variation in diagnostic performance between vendors.

Sensitivity and specificity showed the greatest divergence, two tools prioritised either specificity or sensitivity, and one offered a more balanced performance profile. These differences may influence clinical outcomes and should be carefully considered when selecting AI tools for deployment.
Limitations: Commercial vendors were required to adapt their algorithms to a binary output (‘fracture present’ vs. ‘fracture absent’), whereas in routine use they may also provide a third category (‘possible fracture’). This constraint may have influenced the reported sensitivity and specificity.
Funding for this study: Innovate UK
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was conducted in accordance with the Declaration of Helsinki. Ethics approval was granted for this study under Simulation Training in the Emergency Department clinical trials identifier NCT05427838.
6 min
AI-Enhanced Fracture Detection in the ED: A Real-World Assessment of Diagnostic Accuracy and Imaging Utilisation
Molly Godson Treacy, Dublin / Ireland
Author Block: M. Godson Treacy, S. Doherty, O. O'Brien, F. Desmond, F. Husson, P. J. Macmahon; Dublin/IE
Purpose: Accurate and timely diagnosis of pelvic, hip, and spinal fractures in the Emergency Department (ED) is critical to reducing morbidity and mortality. Gleamer BoneView, an AI-based fracture detection tool, was implemented in our hospital in 2024. This study aims to evaluate the impact that the use of Gleamer BoneView, had on fracture detection in a real-world ED setting. We hypothesise that AI assistance would improve detection rates without increasing downstream imaging.
Methods or Background: We retrospectively analysed cases of adult ED patients undergoing pelvic/hip or spinal radiographs for suspected fracture from June–August 2024 (AI-assisted reporting) and for the same patient cohort in June-August 2023 (radiologist-only reported). Fracture detection was determined by the final radiologist's report. Detection rates were compared using chi-squared tests (p<0.05). CT and MRI utilisation, along with AI diagnostic performance (sensitivity, specificity, PPV, NPV), were also assessed.
Results or Findings: Pelvic/hip fracture detection increased from 12.7% (49/386) in 2023 to 19.2% (70/365) in 2024 (p=0.02).
Spinal fracture detection increased from 14.7% (67/457) in 2023 to 21.3% (64/301) in 2024 (p=0.03).
Downstream CT use remained stable for both pelvic/hip (11.7% vs. 10.4%) and spinal radiographs (7.8% vs. 7.6%).
Downstream MRI use for indeterminate spinal radiographs decreased from 5.3% in 2023 to 3.3% in 2024.

AI diagnostic performance was as follows:
Spine radiographs: Sensitivity 90%, Specificity 93%, PPV 79%, NPV 97%.
​Pelvis/hip radiographs: Sensitivity 85%, Specificity 92%, PPV 71%, NPV 96%.
Conclusion: AI-assisted radiograph interpretation significantly improved fracture detection rates in the ED, without increased CT usage and with reduced MRI demand. These results support the integration of AI tools into acute imaging pathways to enhance diagnostic accuracy and streamline patient management.
Limitations: Short timeframe/small patient cohort. Single Centre Study.
Funding for this study: N/A
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Hospital ethics committee.