Algorithmic Fairness in Radiology: A Practical Deep Dive into AI Challenges
Author Block: B. Mohajer1, A. Zain2, H. Zhang2, Z. Hu3, R. Ball4, J. Gichoya5, A. E. Flanders1, M. Ghassemi2, E. Colak3; 1Philadelphia, PA/US, 2Boston, MA/US, 3Toronto, ON/CA, 4Bar Harbor, ME/US, 5Atlanta, GA/US
Purpose: Machine learning (ML) models have demonstrated expert-level performance in radiology; however, concerns about fairness persist, potentially reinforcing healthcare inequities. AI competitions have advanced the field by providing the most diverse publicly available datasets and hosting AI challenges. However, even in these controlled settings, concerns remain about the fairness of ML models. This study assessed fairness in top-performing ML models from the Radiological Society of North America (RSNA) Cervical Spine Fracture and Abdominal Trauma Detection AI Challenges, focusing on demographic performance differences.
Methods or Background: Predictions from the 9 top-performing models were evaluated using private test sets stratified by age groups, sex, and geographical location. Performance metrics–including false positive rate (FPR), false negative rate (FNR), area under the receiver operating characteristic curve (AUC), and expected calibration error (ECE)–were compared across subgroups.
Results or Findings: The study included 788 participants from the Cervical Spine (64% male, mean-age 54.8 years) and 709 participants from the Abdominal Trauma (69% male, mean age 48.7 years) challenges. No significant AUC or FNR differences were observed across subgroups or between sexes. However, age- and region-specific FPR disparities emerged. For cervical spine fractures, older adults (≥61 years) had higher FPRs (9.7% vs. 2.6%, p<0.05). In abdominal trauma detection, older adults also showed elevated FPRs (11.6%, p=0.003). Geographic variation was notable–Asian patients had higher FPRs (28.0%), while Oceanian patients had lower rates (5.6%, p<0.05).
Conclusion: Despite being trained on the most diverse datasets available, subgroup-specific differences in FPR–particularly across age groups–persisted. These findings highlight that even diverse training data may not entirely eliminate disparities. Continued efforts to improve demographic representation and integrate fairness-aware approaches into ML development are essential.
Limitations: Limited subgroup sample sizes—especially from Africa and South America—may affect robustness of fairness estimates and limit generalizability.
Funding for this study: This study received no direct funding. In preparation of the dataset and AI challenges, funding was recieved from Radiological Society of North America (RSNA).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Approved by