Research Presentation Session
05:43S. Pacilè, Valbonne / FR
Purpose:
To demonstrate and estimate the benefits that an artificial intelligence (AI) tool could bring to the breast cancer detection performance of radiologists.
Methods and materials:The study was designed as a multi-reader multi-case investigation with fully-crossed design so that each case was read by each reader both with and without the aid of AI. It involved 14 participants who read a set of digital mammography (DM) images, half of them without AI and the other half with the help of AI during one first session and complementary cases during a second session. The used dataset included 240 cases (80 true positives, 40 false negatives, 80 true negatives, and 40 false positives).
The AI tool (MammoScreen v1.0.0, Therapixel) that was used is a system designed to identify suspicious regions for breast cancer on DM and assess their likelihood of malignancy. Area under the ROC curve and sensitivity were assessed as endpoints.
Results:Among readers, 11 (79%) increased their AUC using the AI system. The average AUC across readers was 0.769 when reading unaided and 0.797 when using the AI system. The average difference in AUC was 0.028 (95% CI: 0.002, 0.055 and p = 0035). Likewise, average sensitivity was shown to be increased by 0.033 when using AI support (p = 0.021). Figure 1-2 provide case examples where 9 radiologists detected small invasive cancers only using AI, avoiding making a false negative.
Conclusion:The overall conclusion of this study is that the performance of radiologists in reading mammograms is improved with the concurrent use of this new AI-based tool.
Limitations:n/a
Ethics committee approvalThe investigation protocol was approved by an IRB.
Funding:The study was sponsored by Therapixel.
05:51H. Kim, Seoul / KR
Purpose:
To assess the feasibility of artificial intelligence (AI)-based diagnostic-support software and whether it can be used to improve a radiologist’s diagnostic performance in breast cancer screening under European double reading guidelines.
Methods and materials:A total of 320 exams of screening mammograms were retrospectively collected from two institutions; 80 cancer (proven by biopsy), 32 benign (proven by biopsy or follow-up imaging), and 48 normal exams respectively from each institution. A multi-reader multi-case study was conducted with 7 breast radiologists. Each radiologist read each case without and then with the assistance of Lunit INSIGHT MMG, an AI-based diagnostic-support software. Readers decided whether each case needed to be recalled in their first reading and then modified their decision by referring to the output of the software. Radiologist’s performance was evaluated in terms of sensitivity and specificity as follows: 1) single reading (software-unaided), 2) double reading (majority voting of 3 readers, i.e. average of 35 possible combinations of 3 out of 7 readers), and 3) single reading (software-aided). Software standalone performance was also measured.
Results:Average sensitivity and specificity of radiologists based on their binary decision (i.e. recall or not) were 1) single reading (software-unaided): 80.0% and 72.3%, 2) majority voting of 3 radiologists (double reading simulation): 81.9% and 75.4%, 3) single reading (software-aided): 86.3% and 73.8%. Standalone performance of the diagnostic-support software was 88.8% (sensitivity) and 81.9% (specificity).
Conclusion:Radiologist’s diagnostic performance was significantly improved; without vs with assistance of the software: 80.0% vs 86.3% (P<.001) in sensitivity and 72.3% vs 73.8% (P<.05) in specificity. Software-assisted performance in terms of sensitivity was better than the majority voting of multiple radiologists.
Limitations:Real clinical value needs to be investigated via prospective studies.
Ethics committee approvalApproved by IRB. Informed consent was waived.
Funding:Lunit Inc.
05:59A. Lauritzen, København / DK
Purpose:
To investigate whether an AI system can detect normal mammographies in a breast cancer screening cohort.
Methods and materials:This retrospective study analysed 18,020 doubly read studies from the Danish Capital Region breast cancer screening program, comprised of 143 screen-detected cancers and 447 non-cancer recalls (false-positives). Using the deep learning-based image analysis tool, Transpara v1.5, all studies were sorted into 10 categories based on findings from four views. A high category (10) indicated a high probability of malignancy, while a low category (1) indicated a very low probability of malignancy. Normal studies were identified as being in category 5 or less. This study examined the number of studies, and non-cancer recalls, that can possibly be avoided by detecting normal studies before radiologist reading.
Results:Using category 5 or less as a threshold, 10,545 (58.52%) studies were classified as normal. Included were 5 screen-detected cancers (3.5%) and 106 non-cancer recalls (23.71%).
Category 1 and 2 comprised of 4,738 (26.29%) studies, 26 non-cancer recalls (5.82%), and 2 screen-detected cancers (1.36%).
Category 1 comprised of 2,627 (14.58%) studies, 12 non-cancer recalls (2.68%), and 0 screen-detected cancers.
The results show that the AI system can successfully identify normal mammographies with very few missed screen-detected cancers. Furthermore, a substantial amount of false-positive studies were identified as normal. The results suggest that AI systems could potentially effectively and safely reduce the number of studies that radiologists would have to examine by a considerable amount, and several false-positives could be avoided.
Limitations:Transpara identifies a few screen-detected cancers as normal. Having radiologists examine missed cases, future improvements could be made.
The number of cancer cases was limited. Results should be confirmed on a larger study.
n/a
Funding:Partially Eurostars project IBSCREEN ref. 9715.
05:56S. Heywang-Koebrunner, Munich / DE
Purpose:
To test the capabilities of a new 2D-CAD program based on artificial intelligence and deep learning algorithms for systematic consecutive screen reading.
Methods and materials:For this purpose, 18,002 consecutive screening mammograms acquired in our screening unit between 1/2018-11/2018 were anonymised and processed by the CAD system (iCAD Inc.).
A call was considered positive if the case threshold exceeded 30 (which is a more specific threshold) and a hit was visible within the lesion on at least 1 view. Ground truth for benign lesions was a benign screening result (based on independent double reading), a benign consensus reading of cases considered suspicious by at least one reader, or a benign result after complete workup of cases considered suspicious at consensus reading. Malignancies were proven by percutaneous biopsy and surgery. We counted one diagnosis per patient. Patients with bilateral cancers obtained 1 diagnosis per breast.
Results:We excluded 45 drop-outs due to incomplete work-up (refused etc.), 40 cases with B3 lesions and 7 mammographically occult malignancies detected incidentally during assessment or preoperative staging. The evaluation was patient-based; only patients with bi-lateral cancers were evaluated breast-based. Using a case threshold of 30 CAD achieved a sensitivity of 91.5% (for 32 DCIS and 85 inv. cancers) and specificity of 80.2%; reader1: 84.6% and 91.6%; reader 2 89.7% and 91.5%.
Conclusion:Achieved results justify hopes to use novel CAD systems for a second (e.g. in countries with a shortage of readers) or third reading in the near future. Human consensus reading remains indispensable.
Limitations:n/a
Ethics committee approvalNot necessary since completely anonymised data was used.
Funding:No funding was received for this work.
05:33S. Hickman, Cambridge / UK
Purpose:
Cancers can be difficult to detect or even missed due to dense breast tissue. This study investigated the relationship between density, characterised using a masking index, lesion conspicuity, and evaluated subjectively, and cases found by only one reader in a double reading mammography screening system.
Methods and materials:Using the TOMMY trial dataset, the contralateral breast mammogram of 566 invasive cancer cases were analysed with a masking algorithm based on density and tissue arrangement. Two radiologists independently classified each cancer as either subtle or obvious, assigned a conspicuity score (1-3), and provided their decision to recall the case or not. Lesion size, radiological feature, and density as measured by a visual analogue score (VAS) was recorded. A generalised linear mixed-effects model was used to account for the non-negative skewed nature of the data to associate the masking index and the radiologist assessments. Size, density, and feature were fixed effects. Random effects included intercepts for subjects, the site of test, and lesion side.
Results:There were 58 subtle and 508 obvious invasive cancers (median age 62 years). Median lesion size was 13 mm [IQR 9-19 mm] with median density VAS 31 [IQR 20-50]. The masking index reduced by 25.9% (p <0.0001) from subtle to obvious cases. The reduction was 16.2% (p=0.017) after the adjustment for size, radiological feature, and density VAS. With lower density (VAS <50), the masking index reduced by 28.4% (p=0.0003) from subtle to obvious cases following adjustment. However, this was not demonstrated in cases with higher density (VAS >=50).
Conclusion:The masking index corresponds with the radiologist’s assessment and could provide a quantitative measure to target supplemental imaging.
Limitations:Small subtle case sample size.
Ethics committee approvalDataset from ethically approved TOMMY trial.
Funding:CRUK programme grant.
04:31L. Santiago, Houston / US
Purpose:
To evaluate the acceptability and impact of 3D printed breast models (3DM) on treatment-related decisional conflict (DC) of breast cancer patients.
Methods and materials:Patients with breast cancer were accrued in a prospective single-institution trial. All patients underwent contrast-enhanced breast MRI (MRI). A personalised 3D printed breast model (3DM) was derived from MRI. DC was evaluated pre and post 3DM review. Acceptability was assessed post 3DM review.
Results:Pre and post 3DM DC evaluation and 3DM acceptability assessment were completed by 25 patients. Bilateral 3DM was generated in 2 patients with bilateral breast cancer. The mean patient age was 48.8 years (28-72). Tumour stage was Tis (7), 1 (8), 2 (8) and 3 (4). The nodal staging was 0 (19), 1 (7), and 3 (1). Tumours were unifocal (15), multifocal (8), or multicentric (4). Patients underwent mastectomy (13) and segmental mastectomy (14) with (20) or without (7) oncoplastic intervention. 7 patients underwent neoadjuvant therapy. The mean pre and post 3DM DC scores were 16.6 (SD 15.3) and 10.0 (SD 13.2). There was a significant reduction in overall DC post 3DM review indicating patients became more assured of their treatment choice (p=0.001). DCS reduction post 3DM was also observed in the uncertainty (p=0.18), informed (p=.005), values (p=0.41), and effective (p=0.001) DCS subscales. No significant reduction was observed in the support of DCS subscale (p=0.148). 3DM acceptability was rated as good/excellent in understanding their condition, disease size, surgical options, encouraging to ask questions, 3DM detail, 3DM size, and 3DM impartiality.
Conclusion:3DM are an acceptable tool to decrease decisional conflict in breast cancer patients.
Limitations:Small sample size.
Ethics committee approvalInstitutional review board approval. Informed consent obtained.
Funding:John S. Dunn Sr. Distinguished Chair Grant and Robert D. Moreton Distinguished Chair Grant.
05:39A. Lauritzen, København / DK
Purpose:
To validate a fully automatic density estimation tool on a screening cohort in terms of agreement with radiologists’ BI-RADS and cancer risk segregation.
Methods and materials:This study was based on the Danish Capital Region breast cancer screening program from November 1st, 2012, to December 31st, 2013. 4-view FFDMs were available for 53,956 women. The cohort’s median age (IQR) was 59 (54-65) and the cohort comprised of 568 cancers. Radiologist’s BI-RADS 4th edition scores were available from two readers. Using a deep learning-based fully automatic tool developed by the University of Copenhagen, all FFDMs were scored for planimetric percent mammographic density (PMD). The correspondence between two-reader consensus BI-RADS and PMD was evaluated in terms of Spearman correlation and, after categorisation of PMD, with weighted kappa statistics (WKS). The latter was compared to the readers’ interobserver WKS. In terms of cancer risk segregation, the area under the ROC-curve (AUC) was compared for PMD and consensus BI-RADS, both with age as a covariate in a logistic regression model.
Results:The correlation between PMD and consensus BI-RADS was 0.85. The WKS between PMD and consensus BI-RADS was 0.693, and the radiologist inter-observer WKS was 0.692. The AUC for cancer risk was 0.60 (0.57-0.62) for PMD+age and 0.59 (0.56-0.61) for consensus BI-RADS+age.
Conclusion:PMD matched radiologists’ BI-RADS in terms of the agreement between categorised PMD and consensus BI-RADS and the radiologists’ interobserver agreement; both were substantial. For cancer risk segregation, there was no significant difference between consensus BI-RADS and PMD. Regarding personalised screening using mammographic density as a risk factor, the results suggest that automated PMD would work as well as consensus BI-RADS of two radiologists.
Limitations:The two readers work at the same clinic.
Ethics committee approvaln/a
Funding:Partially Eurostars project IBSCREEN ref. 9715.
05:13Ma Jie, Shenzhen / CN
Purpose:
Calcification detection from mammograms plays a vital role in early breast cancer diagnosis. We propose an automatic calcification segmentation solution based on deep learning framework and a series of novel pre-processing methods.
Methods and materials:We develop a series of novel pre-processing methods to normalise mammogram images and encourage the consistency of labels, and then a modified U-Net framework was applied to segment calcifications. Firstly, mammogram images were pre-processed, including window adjustment, breast region extraction, and artefact removal. Small patches with the size of 512*512 were extracted from original mammograms. Then, based on the size and shape, the calcification labels were classified as three types: dots, vessels, and clusters. To encourage the consistency of labels, the patches containing vessels were excluded from the training data. Then, a U-Net model with group normalisation was trained using the processed data. Finally, the obtained best two U-Net models were ensembled to segment calcifications. The training and evaluation were performed on an in-house dataset consisting of 1,776 mammograms with calcifications. Calcifications in this dataset were annotated by two experienced radiologists. The data was randomly split as training (60%), validation (20%), and test data (20%), respectively.
Results:In our evaluation, a predicted calcification is assumed as detected if more than 25% area is covered by the label. Recall at 1 false-positive per image for the three calcification types (dots, vessels, and clusters) are 0.691, 0.971, and 0.912. In comparison, recall at 5 false positives per image for these three calcification types reach 0.918, 1.00, and 0.988, respectively.
Conclusion:The developed pre-processing methods and the modified U-Net framework can effectively segment various types of calcifications.
Limitations:The size of the database is limited.
Ethics committee approvaln/a
Funding:No funding was received for this work.
06:55L. Tanenbaum, New York City / US
Purpose:
To assess performance and potential for workflow reduction using a deep learning triage algorithm.
Methods and materials:An FDA-cleared deep learning (DL) algorithm (cmTriageTM, CureMetrix, Inc.) was used to analyse two data sets of 2D mammograms. The first set was comprised of 400 biopsy-confirmed malignant cases and 855 negative cases from 4 institutions (with at least one-year follow-up as validation of benignity). The second set of 597 sequential screening mammograms was obtained at a single academic institution between July 1 and July 31, 2013.
Results:The overall AUC of the triage software is 0.95 (CI = [0.94, 0.96]). In the first simulation, at the default setting, cmTriage performed with a specificity of 77% and sensitivity of 93% (95% CI = [0.90,0.96]). At very high sensitivity of 99% with 95% CI = [0.97,1.00]), the specificity was 40% and 40% of the cases could have been removed from the worklist.
In the second study, there was a worklist reduction of 30% (at high sensitivity) to 63% (at default) of the screening mammograms.
There was 60% recall reduction at default and 32% reduction at high sensitivity (all false). There was a 11% reduction of benign biopsies at default. The recall rate would be reduced by 44% and 30% respectively with no loss in sensitivity at either operating point. There were no missed cancers at the high sensitivity setting.
Conclusion:Pre-analysis of mammograms using this triage software has the potential to significantly improve radiologist performance and enhance workflow and productivity without impairing sensitivity.
Limitations:This was a retrospective study. The sample size of the two data sets is limited and expansion to larger data sets is needed.
Ethics committee approvaln/a
Funding:No funding was received for this work.
03:32Ma Jie, Shenzhen / CN
Purpose:
Interpreting mammograms is an expertise-required task. We propose an improved encoder-decoder structured CNN model deeplab v3+ for automatic mass segmentation in mammography.
Methods and materials:We collected ~2,250 mammograms with mass lesions from three public datasets (CBIS-DDSM, INbreast, and Breast Cancer Digital repository (BCD) and 910 retrospective mammograms with mass lesions from an in-house dataset (Jan 2017 to May 2019). An encoder-decoder structured CNN, deeplab, integrated with atrous spatial pyramid pooling and atrous depth-wise convolution, was leveraged to segment masses. With Xception as the backbone, a novel pixel-wise focal loss was developed to further improve the detected mass contours. The CBIS-DDSM data and our in-house dataset were randomly split as training (60%), validation (20%), and test data (20%), respectively. The other two datasets (INbreast and BCD), along with the training subset of CBIS-DDSM, constituted the training set, while the validation and test subsets of CBIS-DDSM were the sole validation and test sets. The best model obtained using the public datasets was further fine-tuned on the in-house training data and then applied to the in-house validation and test data, respectively.
Results:The FROC curve of the deeplab model with Xception as the backbone and with the images downsampled by 3 was the highest. The baseline deeplab model reports 0.78 (Average Precision(AP)) and 0.883 (recall). Our best model reports 0.805 (AP) and 0.977 (recall) on the public test data, and 0.901 (AP) and 0.94 (recall) on the in-house dataset.
Conclusion:This study demonstrates a CNN-based segmentation model deeplab v3+, embedded with atrous spatial pyramid pooling and atrous depth-wise convolution, with Xception backbone, for effectively mass segmentation in mammography.
Limitations:The bias caused by race difference is not well considered.
Ethics committee approvaln/a
Funding:No funding was received for this work.