Research Presentation Session
06:00G. Menezes, San Antonio / US
Our imaging device fuses laser optical imaging with grey-scale ultrasound (OA/US) to differentiate between benign and malignant masses of the breast. The study compared the performance of radiologists to a machine learning (ML) classifier.Methods and materials:
We used a subset of 1,585 masses from the PIONEER trial (USA, December 2012-September 2015) to train the ML classifier. The training set consisted of image feature scores that were assigned by 7 independent breast radiologists (5 ultrasound and 5 OA/US features), in addition to mass size, mass depth, patient age, and the mammogram BI-RADS category. We then tested the classifier using all 213 masses from the MAESTRO OA/US trial (Netherlands, March 2015-February 2016). Sensitivity, specificity, and AUC were calculated for both the radiologists and classifier predictions.Results:
The classifier’s sensitivity was 97.0% versus 95.5% obtained by the radiologists. The classifier also outperformed the radiologists in specificity (55.5% vs 41.1%). AUC was 86.9% for the classifier and 83.1% for the radiologists. Partial AUC (over the sensitivity range 95.0% to 100%) was 73.9% (classifier) versus 61.0% (radiologists). Because the classifier and the radiologists use the same feature scores, the only difference between the classifier and the radiologist results is how those feature scores are combined into a final likelihood of malignancy score.Conclusion:
The ML classifier exceeds the performance of radiologists on new/external data. This indicates that the classifier might help radiologists improve their final OA/US assessment. The correct assignment of OA/US features is essential for an optimal classifier performance.Limitations:
The main limitations of our study were sample size and inclusion criteria of the Maestro trial. Only 213 BI-RADS 4A and 4B masses were included.Ethics committee approval
This study was funded by Seno Medical Instruments, Inc.
04:42Ma Jie, Shenzhen / CN
High consistency and accuracy of breast density assessment are necessary and routinely visually assessed by radiologists. Furthermore, breast density is a vital risk factor for breast cancer diagnosis. However, qualitative assessment of mammographic breast density is subjective and differs largely between radiologists.Methods and materials:
We designed two models of two-category, including scatted density and heterogeneously density, and four-category, including almost entirely fatty, scattered areas of fibroglandular tissue, heterogeneously dense, and extremely dense. The novelty of the presented method employed pyramidal residual units instead of down-sampling to a concatenate feature map by increasing it gradually. Data augmentation implemented by this investigation involved several real-time transforms source images during the training progress. Additionally, image preprocessing, extracting the region of interest of breast, and normalising images by using adjudgment of window-width and window-centre were adopted to keep the training-set from different vendors consistent and uniform. In this retrospective study, the developed method was trained and validated by utilising 40,364 and 9,000 mammograms from in-house hospital, respectively.Results:
The presented CNN-based breast density classification models were tested directly by using 1,821 images. The two-classification accuracy was 94.56% and the area under ROC curve (AUC) 96.8%, respectively. The other four-classification accuracy was 81.88% and AUC 94.40%, respectively.Conclusion:
Our experimental results demonstrated high classification accuracies between two hard to distinguish breast density categories. It is helpful for addressing inconsistency in density assessment of mammograms. Importantly, we anticipate that the proposed method will improve assessment of breast density and promote risk judgment of breast cancer. It assists radiologists to provide better notification to patients in breast cancer screening.Limitations:
The bias caused by race difference was not well considered.Ethics committee approval
No founding was received for this work.
05:55M. Beuque, Maastricht / NL
CEM has superior diagnostic accuracy compared to conventional mammography. Applying machine learning techniques would potentially increase the accuracy of CEM even more, but is hampered by the labour-intensive process of contouring lesions. Automated detection and ranking of lesions seen on CEM would shorten this process and facilitate automated classification as part of a machine learning workflow.Methods and materials:
Our dataset (n=828) included both craniocaudal (CC) and mediolateral oblique (MLO) views consisting of low-energy and recombined images. All images contained manual delineations of lesions made by expert radiologists. Pre-processing consisted of: (1) cropping the image around the breast, (2) merging of 2x low-energy and 1x recombined image as layers into an RGB image, and (3) rebinning from 16 to 8 bits using adaptive histogram equalisation. A retinaNet model was trained with ResNet50 as a backbone. We considered a mass to be ‘detected’ when the intersection over union (IoU) of the detected mass area and expert delineation was greater than 0.1.Results:
When looking at both views separately, our model correctly detected 87% of the masses in the test set with an average IoU of 0.7 for detected masses. On the training set, it detected 84% of masses with an average IoU of 0.7. When combining different views, 90 (97)% of the (cancerous) masses were detected on a patient level in the training set (88 (91)% in the test set).Conclusion:
Our automated detection and localisation tool was able to detect the vast majority of masses seen on CEM. The machine learning workflow can be accelerated significantly using this tool as this is normally a time consuming and labour intensive process.Limitations:
This workflow is not applicable for micro-calcification.Ethics committee approval
European Program H2020 (PREDICT-ITN-766276) ; KWF Kankerbestrijding 12085/2018-2.
06:56K. Lang, Malmö / SE
To analyse the detection performance of interval cancers (ICs) on prior standard digital screening mammograms using an artificial intelligence (AI) system.Methods and materials:
430 IC screening exams acquired with different mammography devices at four screening sites in Southern Sweden (2013–2017) were analysed with a deep learning-based AI system. The system assigns risk scores from 1 to 10 with an increasing risk of malignancy. Recall recommendations were also provided by the AI tool at approximately 4% and 1% recall rates (risk score ≥ 9.67, and ≥ 9.92, respectively). For the cases with recommended recall, two experienced breast radiologists classified the IC type in consensus (true-negative, minimal sign, or false-negative) and whether the AI system correctly localised the lesion.Results:
A large number of the ICs had an AI risk score of 10 (33%, 143/430), while 40% were assigned scores 1 to 7 (part of the 70% of the screening population with a lower likelihood of cancer by the AI). At a 4% recall rate, 17% of the ICs (73/430) were included, 74% of which were visible in retrospect (28 minimal sign, 26 false-negative), and all but 7 were correctly localised. At a 1% recall rate, 6% of the ICs (24/430) were included, 79% of which were visible in retrospect (12 minimal sign, 7 false-negative), and all were correctly localised.Conclusion:
The AI system was able to detect a substantial number of interval cancers on the prior screening exam. Applying an AI-derived recall rate recommendation for the most suspicious cases to, for example, a 3rd reader or a consensus discussion, might provide means to help the radiologist reduce the interval cancer rate.Limitations:
A single AI vendor. A retrospective review of IC casesEthics committee approval
Ethics committee approval and informed consent obtained.Funding:
No funding was received for this work.
05:22E. Conant, Philadelphia / US
To evaluate the performance of artificial intelligence (AI) with digital breast tomosynthesis (DBT) by craniocaudal (CC) versus mediolateral oblique (MLO) mammographic views.Methods and materials:
A diagnostic accuracy study of an AI system was conducted with 260 retrospectively collected DBT exams, 65 biopsy-proven cancer, and 195 non-cancer, from sequential blocks of cases from seven U.S. sites, randomly selected to match a screening population. The maximum AI scores overall and across breasts within mammographic view (CC versus MLO) were compared to a previously validated threshold for cancer detection. 95% confidence intervals (CI) for the difference in correlated proportions for sensitivity and specificity were calculated using a McNemar test.Results:
The estimated sensitivity (95% CI) for CC and MLO were 0.74 (0.62, 0.83) and 0.77 (0.65, 0.85), respectively; a difference of -0.03 (-0.30, 0.24). The estimated specificity with CC images was 0.64 (0.57, 0.70), which is significantly higher than that of MLO images at 0.54 (0.47, 0.61), giving a difference of 0.10 (0.01, 0.19). Among the 65 cancer cases, 9 were detected by CC alone, 11 by MLO alone, and 39 by both. Among the 195 non-cancer cases, 45 were correctly ruled out by CC, 26 by MLO, and 79 by both. Overall, the AI system detected 91% (59/65) of cancers and ruled out 41% (79/195) non-cancer cases.Conclusion:
While the AI system was not trained to maximise performance within mammographic view, these results suggest that unbalanced conspicuity across mammographic views may have implications for AI performance with single-view exams.Limitations:
The performance of AI may not be indicative of radiologists in clinical practice.Ethics committee approval
IRB approval and waiver of informed consent was obtained for HIPAA-compliant retrospective case collection.Funding:
iCAD (Nashua, NH) funded this work.
05:34M. Salim, Stockholm / SE
To examine if there was any difference in cancer detection between various state-of-the-art AI CAD algorithms applied as independent readers of screening mammography and if they reached radiologist-level performance.Methods and materials:
This case-control study, nested within a population-based screening cohort during 2008 to 2015, consisted of the latest screening examination for 740 women diagnosed with breast cancer (positive) and a random sample of 8,066 healthy controls (negative). Positive ground truth was determined by pathology-verified diagnosis at screening or within 12 months thereafter. Negative ground truth was determined by 2-year cancer-free follow-up. There were 25 original first-reader radiologists, one for each examination. Three AI CAD algorithms, sourced from different vendors, yielded a continuous prediction score for the suspicion of cancer of each examination. For a binary decision, the cut-point was defined by the mean specificity of the original first-reader radiologists (96.56%). The processing of one of the three AI algorithms has not yet been completed and those results cannot currently be reported.Results:
The average age was 58.2 and 55.1 years for positive and negative cases, respectively. AUC was 0.96 (95%CI: 0.95 to 0.97) and 0.91 (95%CI: 0.90 to 0.92) for AI algorithm 1 and 2, respectively (p<0.001). At radiologists' specificity, the sensitivity was 82%, 66%, and 76% for AI algorithm 1, AI algorithm 2, and the original first-reader radiologists, respectively (p<0.001 for each pair-wise comparison).Conclusion:
One AI algorithm outperformed the other AI algorithm and the original first-reader radiologists. The time has come for prospective screening trials using carefully chosen AI CAD algorithms under controlled circumstances.Limitations:
n/aEthics committee approval
Our ethical review board approved the research in this study and waived the need for individual informed consent.Funding:
Stockholm County Council Dnr 20170802.
04:45N. Sharma, Leeds / UK
To evaluate the level of agreement between an automated mammography image positioning assessment tool and a panel of experts to determine if such a tool may be practically integrated in routine breast screen activities.Methods and materials:
672 FFDM studies rejected due to positioning errors were independently reviewed by 5 Radiographers and 2 Radiologists. Reviewers evaluated studies for positioning errors including (1) portion cut off, (2) inadequate inframammary fold (IMF), (3) pectoralis muscle position, and (4) pectoralis muscle thickness. Inter-rater agreement was evaluated between the consensus of the 7 reviewers and the automated tool using weighted Fleiss’ Kappa.Results:
Inter-rater agreement between the algorithm and the panel of experts ranged from good to excellent (kappa = 0.546-0.84).Conclusion:
An automated mammography image positioning error algorithm demonstrates good to excellent agreement with a consensus of experts and may be effective for continual quality assurance efforts in routine breast screening activities.Limitations:
A limitation in this analysis is that the expert reviewers were not given a training session or training dataset prior to performing their assessments for this study.Ethics committee approval
There was no funding recieved for this study. Densitas Inc. provided in-kind support, such as the use of their Image Review Tool.
05:36V. Romeo, Naples / IT
To assess the applicability of a machine learning (ML) approach using texture analysis (TA) features extracted from ultrasound (US) images acquired in two different institutions to differentiate benign from malignant breast lesions.Methods and materials:
Ultrasound examinations of 117 patients with 135 breast lesions (benign n=91, malignant n=44) from institution 1 and 55 patients with 57 breast lesions (benign=19, malignant n=38) from institution 2 were retrospectively selected. Fine needle aspiration and/or tru-cut biopsy or follow-up were used as a standard of reference. The breast lesions of institution 1 were used as a training set, while the breast lesions of institution 2 were used as a validation set. After grey-scale image normalisation and data discretisation using a bin width=3, breast lesions were manually segmented on 2D US images delineating regions of interest (ROIs). ROIs and images were imported in a dedicated software to extract first, second, and higher-order TA features. Features were then normalised according to the training set population. ML analysis was then conducted to obtain the highest accuracy, expressed as the percentage of correctly classified instances.Results:
A total of 697 features were extracted. Among these, 579 highly correlated (r>0.8) and 37 showing a variance <0.01 were excluded from the final dataset, with 81 features left. A data mining software was then run using the subset evaluator that identified 5 final features. Employing such features with a random forest algorithm, 1,000 iterations, and a 10-fold cross validation, an accuracy of 80.7% was obtained on both training and validation sets.Conclusion:
A ML approach using US-derived features may represents a promising tool to discriminate benign from malignant breast lesions.Limitations:
A retrospective study.Ethics committee approval
No funding was received for this work.
05:48S. Shalaby, Cairo / EG
We have developed an ensemble model of multiple machine learning and deep learning algorithms to detect lesions, detect density levels, and show the overall probability of malignancy in mammography studies.Methods and materials:
We used a pre-trained model that used a DenseNet-121 architecture to train a model against a labelled dataset that consists of 200K+ high-resolution mammography exams to detect lesions and show in an ROI or heatmap. We also used 500K+ labelled exams with structured reports to be used for building a density classification model. We balanced a dataset of 90K studies and used a GoogleNet architecture to train a model that classifies images into two levels (Low and High dense) with an AUC of 89%. Another patch classifier pre-trained model for detecting the probability in malignancy was run against a balanced dataset of 2.5K studies (10K images). This generated a structured dataset that represents predicted lesions locations and classifications per breast
(MLO/CC). The dataset was enriched with some more tags like malignant probability, benign probability, number of malignant/benign lesions, age, density (High/Low), and exposure time.
ML Models used: ridge classifier, decision tree, random forest, ExtraTree, voting classifier, and AdaBoost. The voting classifier model provided the best results with an AUC of 88%.
The ensemble model was tested with experienced radiologists on a subset of 100 studies as phase one and it managed to classify malignancy and density with an accuracy rate of 87%.Conclusion:
The ensemble model improves the accuracy of the models by 6%. A built-in chatbot embedded in the PACS viewer helped radiologists to verify the results in a much easier way.Limitations:
The model will be tested with more datasets to increase the accuracy of the detection.Ethics committee approval