Research Presentation Session: Artificial Intelligence & Machine Learning & Imaging Informatics

RPS 1205 - Recent development in AI for lung nodule detection

March 1, 08:00 - 09:00 CET

7 min
External validation of the Sybil risk model as a tool to identify low-risk individuals eligible for biennial lung cancer screening
Fennie Van der Graaf, Nijmegen / Netherlands
Author Block: F. Van der Graaf1, N. Antonissen1, Z. Saghir2, C. Jacobs1, M. Prokop1; 1Nijmegen/NL, 2Hellerup/DK
Purpose: Lung cancer screening protocols for follow up intervals should minimise harm, maximise cost-effectiveness, and avoid diagnostic delays. ILST suggests biennial follow-up for low-risk participants. The study aimed to retrospectively evaluate Sybil, a deep learning algorithm predicting lung cancer risk for 6 years from one LDCT, comparing it to PanCan2b for identifying biennial screening eligibility.
Methods or Background: DLCST baseline scans included 1870 non-cancer and 25 screen-detected cancer cases, diagnosed within 2 years. Sybil (scan level) and PanCan2b (per nodule) predicted risk of developing cancer within 2 years. For cases with no screen-annotated nodules, the PanCan2b risk score for participants was set as 0%. For both models, we used a nodule-risk cut-off of <1.5% to identify low-risk participants for biennial follow-up, based on ILST. For PanCan2b, the risk dominant nodule per scan was considered.
Results or Findings: The Sybil and PanCan2B models identified 1616 and 1697 individuals, respectively, meeting the criteria for biennial screening. This would result in a reduction of 87% and 94% of CT scans in the second screening round, respectively. The group referred for biennial screening included 8 and 9 cancers for Sybil and PanCan2B, respectively.
Conclusion: Both Sybil and PanCan2B selected a large group of low-risk participants for biennial screening when a <1.5% risk threshold was used at baseline CT. The difference between Sybil and the PanCan2b model is small. More research is needed to study the type of cancers with delayed diagnosis and whether such delay leads to diagnostic stage shift. In addition, more external validation of the Sybil model on other datasets is necessary to further assess its applicability in lung cancer screening, and to evaluate its performance on follow-up imaging.
Limitations: This study is a baseline, retrospective analysis on data from one screening trial.
Funding for this study: Funding for this study is supplied by a research grant that is funded by the Dutch Science Foundation and Mevis Medical Solutions, Bremen, Germany
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study included data collected from the Danish Lung Cancer Screening Trial (DLCST). For DLCST, the Ethics Committee of Copenhagen county approved the study, and informed consent was obtained from all participants.
7 min
Artificial intelligence: the key to significant reduction in baseline LDCT lung cancer screening associated workload when used as a first-read filter
Harriet Louise Lancaster, Groningen / Netherlands
Author Block: H. L. Lancaster1, B. Jiang1, M. Silva2, J. W. Gratema3, D. Han1, J. Field4, G. De Bock1, M. A. Heuvelmans1, M. Oudkerk1; 1Groningen/NL, 2Parma/IT, 3Apeldoorn/NL, 4Liverpool/UK
Purpose: Artificial intelligence (AI) is not a new concept in the field of lung cancer screening. To date, AI has predominantly been used to predict lung nodule malignancy risk (rule-in principle). However, to have an impact on reducing radiologist workload, a new rule-out approach is needed. This study aimed to evaluate if AI can be used as a first-read filter to rule-out negative cases (nodules <100 mm3), so that radiologists would only need to assess indeterminate-positive nodules, therefore significantly reducing workload.
Methods or Background: External validation of AI (AVIEW_LCS, v1.1.39.14) was performed in a UKLS dataset containing 1254 LDCT-baseline scans. Scans were assessed independently by four manual readers and AI. Discrepancies between reads (manual/AI) were reviewed by a consensus panel of two experienced thoracic radiologists, blinded from the original results. Final classification was based on the consensus reference read. Cases were ultimately classified as
correct, positive-misclassifications (PMs) (nodules classified by the reader/AI as ≥100 mm3, but at consensus <100 mm3) and negative-misclassifications (NMs) (nodules classified by the reader/AI as <100 mm3, but at consensus ≥100mm3).
Results or Findings: Based on consensus reference read, 816 (65%) cases were negative and 438 (35%) indeterminate-positive. AI had fewer NMs 68 (5%) than all manual readers [reader 1; NMs 205 (16%), reader 2; NMs 200 (16%), reader 3; NMs 236 (19%), reader 4; NMs 220 (18%)], which was reflected in an AI negative predictive value (NPV) of 91.7% (89.8-91.4%) [reader1; 79.0% (77.5-82.0%), reader 2; 80.1% (81.3-85.4%), reader 3; 77.5% (76.0-79.0%) and reader 4; 78.5% (77.0-80.0%). Workload reduction using AI was calculated at 65% [(total scans;1254-(correct positives; 370 + positive misclassifications; 59))/total scans;1254].
Conclusion: AI negative predictive performance is better than all manual readers. If used as a first-read filter, radiologists would only need to assess 35% of cases with indeterminate-positive nodules, meaning significant workload reduction.
Limitations: An identified limitation of this study was that true positives and negatives based on histological outcome were not reported, analyses will begin shortly.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: The UKLS study was approved by an ethics committee; this substudy was covered by previous approval as it only included de-identified data.
7 min
Can an AI-driven decision aid reduce the time between chest x-rays and treatment for lung cancer patients?
Lorna Cameron, Aberdeen / United Kingdom
Author Block: A. Keen, S. Wilkie, B. E. Morrissey, S. Prior, L. Cameron; Aberdeen/UK
Purpose: The primary aim of this study was to evaluate whether an AI product with the ability to identify chest x-ray (CXR) images of highest risk of lung cancer can reduce the time between imaging and treatment in those patients subsequently diagnosed with lung cancer.
Methods or Background: The NHS Grampian Innovation, Radiology and Cancer Teams collaborated with the Centre for Sustainable Delivery and the Scottish Health Technology Group to design an evaluation of the real-world impact of using an AI product designed to risk stratify CXR images. Full pathway mapping was carried out and baseline time delays between all key points (CXR, CXR reporting, CT, CT reporting, MDT diagnosis and treatment) were established. CXR images flagged as highest risk of lung cancer were expedited for CXR reporting, CT and CT reporting. NHS Grampian radiologists collaborated with the company to calibrate the product in ways that maximised identification of lung cancer whilst not overwhelming CT capacity.
Results or Findings: Several months into the project the time between CXR and CT report has dropped from 22 to 10.3 days (N=132). Radiologists identified 28 images not flagged by the product about which they were concerned about cancer. Thus far, none of these patients have been diagnosed with cancer. Under the current calibration conditions, using radiologists’ judgements, the product performs at 84.4 sensitivity and 90.5 specificity (N=24071).
Conclusion: Early results suggest AI risk stratification of CXR images may help healthcare organisations reduce the time taken to treat people diagnosed with lung cancer. This could be especially important for people who are diagnosed following CXR imaging for non-cancer reasons. In our region, this is about two thirds of people diagnosed with lung cancer.
Limitations: A limitation of this study is that these are early results from a 12 month evaluation.
Funding for this study: Funding was received from the Scottish Government.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: We were advised by our Research and Development Office that permissions were not necessary on this occasion. Local and national clinical governance and executive-level sponsorship is in place.
7 min
Comparison of the impact of reconstruction kernels on pulmonary nodule volumetry in low-dose CT with iterative vs deep learning image reconstruction
Louise D'hondt, Ghent / Belgium
Author Block: L. D'hondt1, C. Franck2, P-J. Kellens1, F. Zanca3, D. Buytaert4, K. Bacher1, A. Snoeckx2; 1Ghent/BE, 2Antwerp/BE, 3Leuven/BE, 4Aalst/BE
Purpose: The objective of the study was to investigate the impact of different reconstruction kernels and its interaction with other imaging parameters on nodule volumetry, since scan protocols, screening guidelines and vendor specifications typically define the soft kernel as the standard, thereby disregarding its potential influence.
Methods or Background: We scanned the Lungman phantom containing 3D-printed lung nodules, encompassing six diameters (4 to 9 mm) and three morphology classes (lobular, spiculated, smooth), using a 256-slice CT scanner at various radiation doses (CTDIvol 6.04, 3.03, 1.54, 0.77, 0.41, 0.20 mGy) and reconstructed using different combinations of either soft or hard reconstruction kernels and iterative reconstruction (IR) or deep learning image reconstruction (DLIR) at varying strengths. The impact of these imaging parameters on semi-automatic volumetry measurements was analysed through multiple linear regression.
Results or Findings: We found that reconstruction kernel significantly impacts volumetric accuracy, both as primary factor and in interaction with the reconstruction algorithm and radiation dose. Overall, volumetric errors are lower with the soft kernel compared to the hard kernel. Additionally, we observed that the soft kernel exhibited reduced errors with increasing radiation dose, while this remained relatively constant across all doses for the lung kernel. Combination of the lung kernel with DLIR resulted in relative reduction in volumetric error up to 50% as opposed to IR, at all doses. Furthermore, this effect became more pronounced as the DLIR strength increased. Across all nodule morphologies and diameters using the lung kernel, DLIR consistently outperformed IR, with relative reductions between 20 and 90% in error.
Conclusion: Compared to other combinations of reconstruction algorithms and kernels, application of DLIR in combination with a hard kernel overall returns the highest volumetric accuracy for all pulmonary nodules, also at (ultra)low radiation doses.
Limitations: An identified limitation is that this is a phantom study.
Funding for this study: Funding was provided by the FWO “Kom op tegen Kanker” project for lung cancer screening research in Belgium (Project number: G0B1922N).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: No ethics committee approval was needed, since this study used phantom images.
7 min
Improving image quality of sparse-view lung cancer CT images using convolutional neural networks
Tina Dorosti, Neuried / Germany
Author Block: T. Dorosti1, A. Ries2, J. B. Thalhammer1, A. Sauter1, F. Meurer1, T. Lasser2, F. Pfeiffer1, F. Schaff2, D. Pfeiffer1; 1Munich/DE, 2Garching/DE
Purpose: This study aimed to improve the image quality of sparse-view computed tomography (CT) images with a U-Net for lung cancer detection and to determine the best trade-off between number of views, image quality, and diagnostic confidence.
Methods or Background: CT images from 41 subjects (34 with lung cancer, seven healthy) were retrospectively selected (01.2016-12.2018) and forward projected onto 2048-view sinograms. Six corresponding sparse-view CT data subsets at varying levels of undersampling were reconstructed from sinograms using filtered back projection with 16, 32, 64, 128, 256, and 512 views, respectively. A dual-frame U-Net was trained and evaluated for each subsampling level on 8,658 images from 22 diseased subjects. A representative image per scan was selected from 19 subjects (12 diseased, seven healthy) for a single-blinded reader study. The selected slices, for all levels of subsampling, with and without post-processing by the U-Net model, were presented to three readers. Image quality and diagnostic confidence were ranked using pre-defined scales. Subjective nodule segmentation was evaluated utilising sensitivity (Se) and Dice Similarity Coefficient (DSC) with 95% confidence intervals (CI).
Results or Findings: The 64-projection sparse-view images resulted in Se=0.89 and DSC=0.81 [0.75, 0.86], while their counterparts, post-processed with the U-Net, had improved metrics (Se=0.94, DSC=0.85 [0.82, 0.87]). Fewer views lead to insufficient quality for diagnostic purposes. For increased views, no substantial discrepancies were noted between the sparse-view and post-processed images.
Conclusion: Projection views can be reduced from 2048 to 64 while maintaining image quality and the confidence of the radiologists on a satisfactory level.
Limitations: The sparse-view data generated for this study was obtained using simplified conditions not reflective of the complex reconstruction processes in clinical settings. Therefore, an exact measure of dose reduction is hence unachievable.
Funding for this study: Funding was received from the Federal Ministry of Education and Research (BMBF) and the Free State of Bavaria under the Excellence Strategy of the Federal Government and the Länder, the German Research Foundation (GRK2274), as well as by the Technical University of Munich - Institute for Advanced Study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the ethical review committee and was conducted in accordance with the regulations of our institution. All data was analysed retrospectively and anonymously.
7 min
Head-to-head validation of AI software for the detection of lung nodules in chest radiographs: Project AIR
Steven S Schalekamp, Nijmegen / Netherlands
Author Block: K. G. van Leeuwen1, S. S. Schalekamp1, M. J. Rutten1, M. Huisman1, C. M. Schaefer-Prokop2, M. De Rooij1, B. Van Ginneken1; 1Nijmegen/NL, 2Amersfoort/NL
Purpose: Multiple commercial artificial intelligence (AI) products exist for the detection of lung nodules on chest radiographs, however, comparative performance data of the algorithms is limited. The purpose of the study was to perform independent stand-alone comparison of commercially available AI products for lung nodule detection on chest radiographs, benchmarked against human readers.
Methods or Background: This retrospective, multicentre (n=7 Dutch hospitals) study was carried out as part of Project AIR, which is a Dutch initiative for independent, repeatable, multicentre validation of AI products in radiology. Seven out of 14 eligible AI products for the detection of lung nodules on chest radiographs were validated on a dataset of 386 chest radiographs. The reference was chest CT within 3 months of the chest radiograph. Performance was measured using area under the receiver operating characteristic curve (AUROC). Random subsets of chest radiographs (n=140) were read by 17 human readers, with varying levels of experience.
Results or Findings: Seven lung nodule detection products were validated on chest radiographs (January 2012 to May 2022) of 386 patients (mean age, 64 years ± 11 [SD]; 223 males). Compared to human readers (mean AUROC, 0.81 [95% CI: 0.77, 0.85]), four products performed better (AUROC range, 0.86-0.93 [95% CI: 0.82, 0.96]; P range, <.001-.04). No significant difference was found between the remaining three products and human readers (AUROC 0.79 [0.74,0.84] P=.33, 0.80 [0.75, 0.85] P=.60, 0.84 [0.80, 0.88] P=.26).
Conclusion: Compared to human readers, four AI products for detecting lung nodules on chest radiographs showed superior performance whereas three other products showed no evidence of difference in performance for the detection of lung nodules.
Limitations: The added value of these AI products in clinical practice has yet to be determined.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: The analysis was on anonymised retrospective data.
7 min
Autonomous algorithmic monitoring of a deep-learning chest radiograph AI model using temporal divergences, in a clinical real-world setting
Charlene JY Liew, Singapore / Singapore
Author Block: J. Y. C. J. Liew1, A. Gupta2, A. M. Surve2, V. K. Venugopal2; 1Singapore/SG, 2New Delhi/IN
Purpose: This study aimed to evaluate autonomous monitoring of AI models in a clinical environment by measuring temporal divergence of mathematical probability distributions.
Methods or Background: Daily prediction scores and overall abnormality prediction score of a chest radiograph classification solution (Lunit Insight CXR) were used. There were a total of 11,572 chest radiograph studies analysed over 57 days continuously on an AI platform solution in a real-world clinical setting. Of these, 7,005 studies were classified as abnormal by the AI model. The probability distributions for the abnormal predictions and probability scores for 10 abnormal diagnoses were plotted on a daily basis. Jensen-Shannon divergence was used to measure the similarity between the probability distributions of current day and the previous day in a continuous moving fashion. Daily divergence between the probability distributions against the distribution of a fixed reference was also measured. A threshold of 0.2 for acceptable divergence was used. The studies in days where the threshold was breached were reviewed for any potential errors or misclassification.
Results or Findings: On Day 55, there was a system technical downtime, resulting in fewer cases being processed. This day's divergences were particularly prominent, with Pneumothorax recording 0.723. Excluding day 55, divergence ranged from 0.009-0.329 across findings. Divergence values for such findings were recalibrated against the moving averages of the previous three days.
Conclusion: We introduced an innovative algorithmic system to monitor deep learning AI solution performance using divergence scores. Divergences were detected on days where there were technical downtimes in the AI system. This emphasises the importance of continuous monitoring of AI in clinical applications, to detect various failures of AI models, which may be due to catastrophic algorithmic failure, data bias, model drift or population data drift.
Limitations: No limitations were identified.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was approved by SingHealth Central Institutional Review Board: 2023/2280.
7 min
A deep learning module for automated detection and reporting of clinically significant lung nodules on low-dose chest CT scans
Veljko Popov, Wenham / United States
Author Block: V. Popov1, J. Afnan1, U. Kalabic2, Z. Li3, D. Chen3, D. Hassan2, D. Radisic2; 1Burlington, MA/US, 2Wenham, MA/US, 3East Lansing, MI/US
Purpose: Lung cancer remains the leading cause of cancer death worldwide. Multicentre trials (NLST, NELSON) have proven the efficacy of lung cancer screening in high-risk patients using low-dose, non-contrast chest CT scans. A novel Artificial intelligence (AI) module for automated nodule detection and output to the structured report is proposed, to assist with increasing screening rates while maintaining high levels of diagnostic accuracy.
Methods or Background: The nnDetection framework was applied to train a one-stage detector to segment nodules. Predictions from the nodule detector were fed through an efficient mechanism for reducing overlapping bounding boxes and a separate 3D deep convolutional neural network was trained for false positive reduction (FPR).
The model was then trained on the LUNA16 database (800+ LDCT studies). The model was tested on a holdout subset of LUNA16 (89 studies) and the Cornell ELCAP database (40 studies), for nodules 6 mm or greater.
Results or Findings: LUNA16 dataset: The performance of the nnDetection framework results in a recall of 100%, a precision of 77%, and a false negative rate of 0%. Adding the FPR model, the recall remains at 100%, the precision increases to 84%, and the false negative rate is 0%.
ELCAP dataset: For nodules 6 mm or larger, nnDetection with FPR results in a recall of 100%, a precision of 58%, and a false negative rate of 4%.
Conclusion: nnDetection + FPR performs very well in detecting clinically relevant nodules on the LUNA16 dataset. In addition, the model shows the ability to scale across LDCT datasets without fine tuning when applied to the ELCAP Cornell dataset, detecting all nodules 6 mm or greater.
Limitations: An identified limitation was the small datasets.
Funding for this study: Private funding was obtained for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Ethics committee approval was not needed as the study used publicly available datasets.

This session will not be streamed, nor will it be available on-demand!