Benchmarking of Artificial Intelligence and Radiologists for Indeterminate Lung Nodule Malignancy Risk Estimation on Screening CT: Results of the LUNA25 Challenge
Author Block: D. Peeters1, B. Obreja1, N. Antonissen1, Z. Saghir2, U. Pastorino3, G. De Bock4, R. Vliegenthart4, M. Prokop1, C. Jacobs1; 1Nijmegen/NL, 2Hellerup/DK, 3Milan/IT, 4Groningen/NL
Purpose: Accurate risk classification of indeterminate (5-15mm) lung nodules can reduce unnecessary follow-up in lung cancer screening. AI may assist in risk classification, however, benchmarking studies are limited. Here, we present the results of the LUNA25 challenge, a public competition that evaluates AI and radiologist performance for malignancy risk estimation of indeterminate nodules at screening CT.
Methods or Background: LUNA25 consists of an AI study and a reader study. For AI development, participants had access to a public dataset of 4069 CT scans from the National Lung Cancer Screening Trial (NLST), with 555 malignant and 5608 benign nodules. AI evaluation was performed on an external test set with 156 malignant and 312 benign indeterminate solid and part-solid nodules from baseline scans of the Danish (DLCST), Dutch-Belgian (NELSON), and Italian (MILD) lung cancer screening trials. For the reader study, radiologists assessed 300 nodules from the test set, assigning each a malignancy risk score (0–100) and management recommendation (low, intermediate, or high-risk). Performance was compared using area under the ROC curve (AUC), sensitivity, and specificity.
Results or Findings: On the subset of 300 nodules, the top-performing AI system showed a statistically superior AUC of 0.78 (95% CI :0.73-0.84, p<0.001) in comparison to the average AUC of 75 readers with an AUC of 0.69 (95% CI :0.64-0.74). At the ≥ indeterminate risk threshold, the AI correctly classified 12% more malignant cases at matched specificity, and 20% fewer false-positives at matched sensitivity.
Conclusion: The top-performing AI system demonstrated statistically significant superior performance compared to the average radiologist in estimating malignancy risk for indeterminate lung nodules detected on screening CT, highlighting its potential use as a decision-support tool.
Limitations: LUNA25 only benchmarks AI’s stand-alone performance and does not address workflow integration or radiologist-AI interaction.
Funding for this study: This work was supported by a public-private research projectwith funding from the Dutch Research Council (NWO), the Dutch Ministry of Economic Affairs, and MeVis Medical Solutions (Bremen, Germany), as well as by a public-private project with funding from the Dutch Cancer Society (KWF Kankerbestrijding, project number 9037) and Siemens Healthineers, as well a project with funding from the Dutch Cancer Society (KWF Kankerbestrijding, project number 14113).
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Ethical approval for the training set was granted with the NLST trial receiving institutional review board approval at all 33 centers participating in the trial. In addition, informed consent was provided by all participants involved in the trial. Access to this dataset was granted through the National Cancer Institute's Cancer Data Access System (CDAS) under project number NLST-74, NLST-111, NLST-164 and NLST-267.
Ethical approvals for the testing set were obtained from the Ethics Committee of Copenhagen County (DLCST), the institutional review board of Fondazione IRCCS Istituto Nazionale Tumori di Milano (MILD), and the Dutch Minister of Health with support from the Dutch Health Council (NELSON), along with authorization from the Ethical Boards of participating centres.