Research Presentation Session: Imaging Informatics and Artificial Intelligence

RPS 2105 - Artificial intelligence in chest imaging

March 1, 16:00 - 17:30 CET

7 min
Standardized platform to evaluate, compare, and analyze AI-based software for detection and classification of lung nodules for the purpose of CT lung cancer screening implementation
Xiaotong Ouyang, Groningen / Netherlands
Author Block: X. Ouyang1, K. Togka2, D. Han2, H. L. Lancaster2, I. Schuldink2, A. N. Walstra2, C. Van Der Aalst1, H. J. De Koning1, M. Oudkerk2; 1Rotterdam/NL, 2Groningen/NL
Purpose: Low-dose CT detects lung nodules and consequently lung cancer (LC) at an early stage, proven to reduce LC mortality. To aid radiologists, commercially available AI-based software have been developed to analyze lung nodules. Self-reported performance metrics appear promising, however, there remains no independent, standardized platform for external validation. We aimed to develop a standardized, independent, trustworthy platform to assess and compare the performance of commercially available AI-lung nodule analysis software.
Methods or Background: We developed a platform using a sequestered dataset of 560 scans from the EU-funded 4-IN-THE-LUNG-RUN (4ITLR) lung cancer screening implementation trial. The platform is based on systematic Structured Query Language (SQL) database architecture. Output of AI software in different data formats was reformatted and stored to standardized SQL records, eliminating manual errors, and allowing AI software results to be compared using uniformed data analysis algorithms to the final consensus result of an expert radiologist panel. Performance is evaluated on two levels: nodule level and participant level. Nodule level compares the detection/classification of the reference nodule per participant and participant level was based on the largest-detected solid nodule.
Results or Findings: Performance of AI software at nodule level was reported using frequencies of agreement and discrepancies with the 4ITLR consensus result on reference nodule. At the participant level, Cohen's kappa coefficient is used to measure the agreement level with reference.
Conclusion: The standardized platform developed provides an independent assessment of AI software performance. Clinical users benefit from reliable comparison of outcomes for lung nodule analysis and transparency of commercial AI in radiology.
Limitations: No limitations have been identified yet.
Funding for this study: The 4-IN-THE-LUNG-RUN trial is funded by the European Union (grant number:848294)
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Not applicable
7 min
Benchmarking of Artificial Intelligence and Radiologists for Lung Cancer Screening in CT: The LUNA25 Challenge
Dre Peeters, Nijmegen / Netherlands
Author Block: D. Peeters1, B. Obreja1, N. Antonissen1, R. Dinnessen1, Z. Saghir2, E. Scholten1, R. Vliegenthart3, M. Prokop1, C. Jacobs1; 1Nijmegen/NL, 2Hellerup/DK, 3Groningen/NL
Purpose: The imminent implementation of lung cancer screening and growing workload for radiologists demonstrates the need for safe and validated AI algorithms. At present, it is challenging to adequately validate and benchmark the increasing amount of AI algorithms being developed. In this study, we present the LUNA25 challenge, a public competition aiming to evaluate the diagnostic performance of AI algorithms and radiologists in lung nodule malignancy risk estimation at screening CT.
Methods or Background: The LUNA25 dataset will include 5051 screening CT scans from the National Lung Cancer Screening Trial (NLST), with 624 malignant and 7414 benign nodules. Participating teams can access this dataset to develop AI algorithms. For algorithm validation, a separate set of 65 malignant and 818 benign nodules from the Danish Lung Cancer Screening Trial (DLCST) will serve as a hidden test set. Additionally, a subset from DLCST with indeterminate nodules measuring 5-15mm in diameter will be assessed by a panel of radiologists with varying experience levels to benchmark radiologists’ performance against AI algorithms. Performance will be measured using area under the ROC curve (AUC) and at different operating points in terms of sensitivity and specificity.
Results or Findings: With the NLST and DLCST cohorts collected, the challenge is ready to be introduced to the ECR audience. Preliminary results with an in-house developed AI algorithm demonstrated a mean AUC of 0.91 [0.87, 0.95] on DLCST.
Conclusion: The LUNA25 challenge expects to establish a worldwide benchmark for AI algorithms in estimating lung nodule malignancy risk at screening CTs and offer insights into how AI compares to radiologists across different experience levels and operating points.
Limitations: LUNA25 only benchmarks AI’s stand-alone performance, and does not address workflow integration or radiologist-AI interaction, which are important for clinical adoption.
Funding for this study: Funding was provided by the Dutch Cancer Society
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The institutional review board waived the need for informed consent because of the retrospective design and data pseudonymization.
7 min
Systematic prioritisation of ai-detected chest x-ray abnormalities for optimised lung cancer detection: a multicentre study
Rhidian Bramley, Manchester / United Kingdom
Author Block: R. Bramley1, A. Sharman1, R. Duerden2, S. Lyon1, M. Ryan3, E. Weber4, L. Brown1, M. Evison1; 1Manchester/UK, 2Stockport/UK, 3Sydney/AU, 4Linköping/SE
Purpose: This multicentre study aimed to establish a reproducible and data-driven method for selecting AI-detected chest X-ray (CXR) abnormalities to be prioritised for urgent reporting, supporting faster lung cancer diagnosis. By analysing cancer prevalence and clinical significance across two distinct cohorts from seven acute trusts, the study sought to maximise lung cancer detection while maintaining a high negative predictive value (NPV).
Methods or Background: The study involved two cohorts: a retrospective cohort of 1,282 CXR from primary care with detectable lung cancer (Cohort 1), and a prospective cohort of 13,802 consecutive primary care adult CXR (Cohort 2), with AI deployed in shadow mode. The Annalise-AI platform identified 124 distinct findings. An interactive tool was developed to assess prioritisation strategies based on the cancer prevalence ratio of each AI finding individually and in combination, combined with clinical judgement.
Results or Findings: The final prioritisation strategy flagged 41 AI findings which included 95.9% of cancers in Cohort 1 and 21.6% of CXR in Cohort 2 (sensitivity 95.87%, specificity 79.11%, PPV 4.43%, NPV 99.95%). A further 15 AI findings were prioritised based on clinical judgement as findings not associated with cancer, but requiring prioritisation as potentially needing prompt intervention.
Conclusion: This study demonstrates a reproducible and data-driven method for prioritising AI-detected CXR abnormalities, balancing the need for high sensitivity and NPV while reducing unnecessary prioritisation of low-risk cases. The shadow mode approach ensured clinical safety before deployment, and the interactive tool provided a systematic means to assess prioritisation strategies, offering a practical alternative to traditional judgement-based methods and supporting more efficient lung cancer diagnosis.
Limitations: The tool is designed to support assessment of AI performance in shadow mode in the referral population. Performance metrics should be validated before deployment in other populations.
Funding for this study: Funding was provided by the NHS England National AI Diagnostics fund (AIDF).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: The study was performed in shadow mode and did not impact on patient care.
7 min
Beyond Nodules: A Deep Learning Approach for Comprehensive Lung Tumour Segmentation on CT
Liliana Petrychenko, Amsterdam / Netherlands
Author Block: L. Petrychenko, V. Pugliese, R. G. H. Beets-Tan, L. Topff, K. Groot Lipman; Amsterdam/NL
Purpose: Several commercially available AI applications for lung nodule analysis on chest CT are limited to the detection and segmentation of nodules up to 30 mm. There is clinical potential for AI-assisted volumetric analysis and treatment monitoring of lung tumours of any size, including masses. We aim to develop a Deep Learning model to detect and segment lung lesions, including primary cancers of all T-stages.
Methods or Background: In this retrospective study, we collected 1001 chest CT scans from 504 patients (mean age 66.4±10.3 years; 52% female) with histopathologically confirmed primary lung cancer, treated at the Netherlands Cancer Institute. Both the baseline and first follow-up scans after treatment were included. Patients were randomly assigned to 90% training and 10% testing sets. Two radiologists (4-7 years of experience) performed manual segmentation of all lung nodules ≥ 3mm and masses. The deep learning model development utilized a Residual Encoder nnUNet backbone. SGD was selected as the optimizer, 10⁻² was set as the initial learning rate, 2 was set as Batch Size, and a nnUNet ResEnc XL architecture was selected.
Results or Findings: The dataset represented all T-stages (Tis/T1/T2/T3/T4: 1.6/34/21/16/28%) and major histopathological types, with lesion sizes ranging
from 3 to 135 mm. DL model achieved a median Dice Similarity Coefficient (DSC) of 90.0% across all lung lesions, with a median of 1 false positive detection per scan. For primary lung tumors, detection sensitivity was 77.2%, and median DSC was 90.4%.
Conclusion: The DL model demonstrated very good segmentation performance for primary lung tumors of all sizes, including masses. The model has the potential to assist physicians in treatment monitoring and planning, though further improvements in detection sensitivity could enhance its clinical utility.
Limitations: The model requires both external and clinical validation.
Funding for this study: No additional funding was received; the study was conducted entirely at the Netherlands Cancer Institute.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study received Institutional Review Board (IRB) approval.
7 min
Foundation Model-based Unsupervised CT Kernel Conversion for Standardizing Emphysema Quantification
Doohyun Park, Seoul / Korea, Republic of
Author Block: D. Park, J-H. Kang, J. Jeong; Seoul, Republic of Korea/KR
Purpose: Emphysema quantification is crucial for evaluation and management of chronic obstructive pulmonary disease (COPD). Typically, emphysema is identified in computed tomography (CT) images reconstructed with smooth kernels. However, CT reconstruction kernels vary, and raw data are often deleted after reconstruction, making it hard to adjust the kernel retrospectively. Therefore, this study aims to develop and validate a method for kernel conversion to standardize emphysema quantification using a foundational deep learning model.
Methods or Background: Paired CT images from nine cases reconstructed with different kernels were used. Automated lung segmentation was performed using TotalSegmentator, a foundational deep learning model. An unsupervised kernel conversion method was then applied to transform the images to a pre-defined kernel. The kernel conversion was evaluated by comparing the emphysema score (ES), defined as the ratio of regions with HU below -950 within the lung area, before and after the conversion.
Results or Findings: Before kernel conversion, the mean ES difference between images reconstructed with smoother kernels (ex: B30f and STANDARD) and those with sharper kernels (ex: B60f and LUNG) was 11.00±6.85%. After conversion to the target smooth kernel, the mean ES difference was reduced to 2.30±2.65%. Although the sample size was small, this reduction was statistically significant based on a paired t-test (p=0.011).
Conclusion: The foundational model enables the conversion of CT images reconstructed with different kernels to a target smooth kernel, allowing for standardized emphysema quantification without the need for additional datasets for model development. This result suggests that the approach can be easily used by anyone with the appropriate software.
Limitations: For more rigorous validation, it is necessary to not only compare the difference of ES before and after kernel conversion but also comparative evaluation on ground-truth emphysema masks.
Funding for this study: Not applicable.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: We used a dataset from the Korea Testing Laboratory (KTL) challenge.
7 min
Scientific Evidence of AI in Lung Nodule Evaluation on CT-examinations: A Systematic Review
Jasika Paramasamy, Breda / Netherlands
Author Block: J. Paramasamy, J-W. Groen, A. Leliveld, B. Willems, J. Aerts, A. Odink, J. J. Visser; Rotterdam/NL
Purpose: The purpose of this study was to systematically review the scientific evidence demonstrating the efficacy of CE-marked and/or FDA-cleared AI-applications for pulmonary nodule evaluation on CT examinations.
Methods or Background: Following the PRISMA guidelines, Medline, Embase, Web of Science, Cochrane, and Google Scholar databases were searched (Jan 1, 2012–Sep 30, 2024) for studies on AI-based evaluation of pulmonary nodules on CT-scans. Included articles were classified according to a hierarchical model of AI-efficacy: Radiology AI Deployment and Assessment Rubric (RADAR) framework. Additionally, the evolution of evidence over time was examined.
Results or Findings: A total of 98 articles encompassing AI-applications for lung nodule evaluation from 16 vendors were included, with approximately 90% of clinical questions addressed through cross-sectional studies. These publications primarily focused on automatic lung nodule detection, accounting for 61.8% of the studies. All included articles were classified based on their highest level of efficacy using RADAR, with the majority (41/98) at level 2 (diagnostic accuracy). Standalone nodule detection sensitivities in these studies ranged from 50% to 99%. No studies were identified at efficacy levels 5 (patient outcomes), 6 (cost-effectiveness), or 7 (local impact). The number of articles at levels 3 (diagnostic thinking) and 4 (therapeutic impact) was 1 between 2012-2016, and increased to 40 between 2020-2024.
Conclusion: Current scientific evidence for AI-applications in lung nodule evaluation primarily emphasizes diagnostic accuracy. However, there is a noticeable shift in research towards exploring the potential clinical impact of this technology.
Limitations: No meta-analysis was conducted due to significant heterogeneity in methods and reporting. Moreover, vendor involvement in most studies could potentially influence outcomes and introduce bias. Furthermore, as AI for lung nodule evaluation rapidly evolves, the included articles since 2012 may reflect variations in AI-application performance over time.
Funding for this study: Unrestricted institutional grant
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Systematic reviews of existing published literature
7 min
On the effect of lesion number on the FROC performance in AI-based lung nodule detection
Thibault Escobar, Montpellier / France
Author Block: T. Escobar, E. Oubel; Montpellier/FR
Purpose: In the lung nodule detection context, we aimed to determine weather the number of lesions per patient acts as a confounding variable in performance evaluation, potentially affecting metrics like sensitivity and FROC, and if this factor should thus be rigorously controlled during testing.
Methods or Background: Two experiments were conducted using the LIDC-IDRI dataset. In both experiments, a trained model was evaluated on sub-groups of patients sorted by deacreasing lesion number. The first experiment formed cumulative sub-groups by adding and discarding 10 patients at a time, creating groups with varying patient numbers. To ensure no effect nor spurious correlation related to patient number, the second employed a sliding window of 100 patients with a step of 10. For each sub-group, FROC were computed based on 5-fold cross-validation predictions for the whole dataset. Additionally, false positives (FP/s), false negatives (FN/s), true positives per scan (TP/s), and sensitivity (Se) were evaluated to identify which parts of the FROC were affected.
Results or Findings: A clear inverse relationship between the number of lesions and FROC scores was observed. Pearson and Spearman correlation coefficients were significant and equal to -0.9 for both experiments. As lesion number increased, TP/s, FP/s, and FN/s increased, while Se decreased (i.e. TP/s increase did not compensate for FP/s and FN/s ones.).
Conclusion: The number of lesions per patient inversly affects the FROC in lung nodule detection models. The number of lesions per patient should thus be controlled and documented during model evaluation to ensure accurate performance assessments and to clarify under which conditions they are guaranteed. Further studies are required to rigorously examine these effects and validate the hypotheses.
Limitations: Limitations include no specific investigation of the sources of FP/s and FN/s increase with number of lesions.
Funding for this study: This study was totally funded by the compagny Intrasense SA as part of its research and development activity.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Not applicable
7 min
Optimizing Healthcare Sustainability through AI-Assisted Lung Cancer Detection at the time of initial CXR
Jack Packer, London / United Kingdom
Author Block: J. Packer, M. Storey, A. Chung, S. J. Rickaby, G. Dean, S. C. Shelmerdine, C. Malamateniou; London/UK
Purpose: Lung cancer is the leading cause of cancer mortality in the UK. Early diagnosis is essential but hindered by workforce shortages and limited CT access. This study evaluates whether the 'Artificial Intelligence triage to same-day CT' (AI-CT) pathway, using the Annalise CXR v2.3 model, can enhance healthcare sustainability by reducing patient visits, travel emissions, and administrative workload while improving CT access for suspected cancer.
Methods or Background: Sustainability indices for 26,660 patients (January 2022–October 2023) were assessed across five NHS centres in London. Key metrics included time from chest radiograph to CT report (Time to CT), AI accuracy, and cancer suspicion on CXR and CT, pre- and post-AI-CT implementation. Time to CT was measured using survival analysis, and diagnostic performance (AUC-ROC, F1 scores) was calculated based on CT-confirmed cancer. Time saved for patients and admin teams was estimated, and carbon reduction was calculated using the Carbon Trust online calculator.
Results or Findings: From 26,660 chest radiographs and 573 CT scans, 75 of 10,833 patients received same-day CT post-AI, compared to 13 of 8,434 pre-AI, eliminating 150 appointments. Each appointment saved ~1.5 hours per patient, totalling 225 hours, with ~37.5 additional hours saved for admin teams. With travel emissions estimated at 1 kgCO2e per patient, this potentially resulted in a 150 kgCO2e reduction. The AI model showed high sensitivity (91%) but low specificity (22%, F1 score 0.56). There was a significant increase in CT within 1 and 3 days post-suspicious CXR (HR 1.93, 1.34; p < 0.001).
Conclusion: The AI-CT pathway improved same-day CT access and reduced patient visits and emissions. However, the model's low specificity suggests a need for supervised triage to optimize performance.
Limitations: The AI’s low specificity and dependence on co-located facilities, limit generalizability.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Local trust clinical audit and QI registration forms approved.
7 min
An artificial intelligence software for the detection of benign and non-typically benign pulmonary nodules on chest CT scans
Souhail Bennani, Paris / France
Author Block: S. Bennani1, N-E. Regnard2, M. Durteste1, V. Marty1, R. Quilliet1, A. Pourchot1, L. Clovis1, J. Ventre1, G. Chassagnon1; 1Paris/FR, 2Lieusaint/FR
Purpose: Detecting lung nodules on chest computed tomography (CT) is an important task that extends beyond the realm of lung cancer screening. This study aimed to compare the performance of radiologists to an AI software in identifying both non-typically benign and benign nodules on CT scans.
Methods or Background: We retrospectively collected thin-section chest CT scans from private practices across France focusing on patients aged 15 or older. The dataset included patients with non-typically benign (solid and sub-solid nodules), benign (granulomas and intrapulmonary lymph nodes), or no nodules. An expert thoracic radiologist defined the ground truth using past and follow-up scans as well as radiologist reports. We compared the performance of four radiologists who had access to limited clinical information with that of an AI software, LungCT (Gleamer, Paris, France). We conducted patient-wise ROC and lesion-wise FROC analyses.
Results or Findings: The final dataset included 250 chest CT scans (age = 66 ± 23 y, 117 women, 133 men). Among these, 128 scans contained at least one non-typically benign nodule, 40 displayed only benign nodules and 82 were nodule-free. The analysis focused on nodules with a diameter >6mm. The patient-wise AUC of the AI was 0.97 [0.93,1.00] and that of radiologists was 0.88 [0.83,0.92]. On a lesion-wise basis, the AI achieved a sensitivity of 79% [75%,83%] for 0.30 false positive (FP) per scan. Radiologists exhibited an average sensitivity of 72% [67%,76%] with a mean FP rate of 0.34 [0.25,0.45].
Conclusion: The AI solution demonstrated robust patient-wise performance and comparable lesion-wise detection of non-typically benign and benign nodules on CT scans to radiologists.
Limitations: The study’s retrospective design and limited sample size could affect the generalisability of results. Future research should evaluate the performance of AI-assisted radiologists.
Funding for this study: Gleamer (Paris, France) funded this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Not applicable
7 min
Improving the generalisation of radiographic AI using automated data curation to mitigate shortcut learning
Ian Andrew Selby, Cambridge / United Kingdom
Author Block: I. A. Selby1, E. González Solares1, A. Breger2, M. Roberts1, J. Babar1, F. J. Gilbert1, N. Walton1, C-B. Schönlieb1, J. R. Weir-Mccall3; 1Cambridge/UK, 2Vienna/AT, 3London/UK
Purpose: To investigate whether automated data curation pipelines for chest radiographs can improve deep-learning model performance on unseen data.
Methods or Background: Two public datasets, MIDRC-1A and MIDRC-R1, were used to develop diagnostic COVID-19 models using four architectures (DenseNet121/ResNet152V2/VGG16/EfficientNetB3). Each was trained four times using a different data curation workflow: WF1. Raw pixel data with partitioning stratified on dataset and COVID-19 status; WF2. DICOM-cleaned data with look-up tables applied, lateral projections and non-chest radiographs excluded, classes balanced on Manufacturer and Projection tags, and partitioning additionally stratified on the same metadata; WF3. Cases excluded using an open-source data-cleaning pipeline (AutoQC, https://gitlab.developers.cam.ac.uk/maths/cia/covid-19-projects/autoqc). Partitioning was stratified on projection and the presence of a pacemaker using AutoQC annotations; and WF4. The previous two workflows combined. COVID-19 diagnosis was inferred from laboratory tests, and model performance was assessed using five other public datasets. Generalisation from internal-to-external data was quantified using ΔAUCs.
Results or Findings: 43,176 radiographs were included in WF1, with 33.2% (14,328) being COVID-19-positive. The development sets of the other workflows were up to 60% smaller. Similarly, the external test sets ranged from 24,563-to-38,417 patients, depending on workflow. The WF1 models experienced the largest fall in generalisation (mean ΔAUC = -0.15 [95%CI:-0.17,-0.14]), while models trained utilising AutoQC (WF3-4) demonstrated the most consistent performance with mean ΔAUCs = -0.04 [95%CI:-0.06,-0.02] and -0.02 [95%CI:-0.04,0.00] for WF3 and WF4 (p<0.05). The WF2 models had a mean ΔAUC = -0.07 [95%CI:-0.09,-0.05].
Conclusion: Automated data curation can improve the generalisation of deep learning models for chest radiographs, facilitating more consistent performance on data from new locations and equipment.
Limitations: Future work should evaluate the tool in multiclassification tasks and non-COVID-19 datasets. In addition to the current pacemaker detection, tools for a broader range of support apparatus are necessary.
Funding for this study: The authors wish to acknowledge support from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking - DRAGON (101005122) (I.S., A.B., M.R., L.E.S., J.B., C.-B.S., E.S., J.W.M., AIX-COVNET); the National Institute for Health and Care Research (NIHR) Cambridge Biomedical Research Centre (BRC-1215-20014) (I.S., L.E.S., J.H.F.R., E.S., J.W.M.); Wellcome Trust (J.H.F.R.), British Heart Foundation (J.H.F.R.); the EPSRC Cambridge Mathematics of Information in Healthcare Hub EP/T017961/1 (M.R., J.H.F.R., C.-B.S.); Cancer Research UK (CRUK) National Cancer Imaging Translational Accelerator (NCITA) [C42780/A27066] (L.E.S.); Cambridge Mathematics of Information in Healthcare (CMIH) Hub EP/T017961/1; Austrian Science Fund (FWF, project T-1307) (A.B.); and the Trinity Challenge BloodCounts! project (M.R., C.-B.S.). The AIX-COVNET collaboration is also grateful to Intel for financial support.

C.B.S. additionally acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC advanced career fellowship EP/V029428/1, EPSRC grants EP/ S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Awards 215733/Z/ 19/Z and 221633/Z/20/Z, the European Union Horizon 2020 research and innovation program under the Marie Skodowska-Curie grant agreement No. 777826 NoMADS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute.

Please note that the content of this publication reflects the authors’ views and that neither IMI nor the European Union, EFPIA, or the DRAGON consortium are responsible for any use that may be made of the information contained therein.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The Brent Research Ethics Committee, the Health Research Authority (HRA), and Health and Care Research Wales (HCRW) provided ethical approval for our retrospective study (IRAS ID: 282705, REC No.: 20/HRA/2504, R&D No.: A095585). Informed consent was not required as data was pseudonymised.