Research Presentation Session
09:48Or Shwartzman, Beersheba / IL
Purpose:
We quantitatively evaluated the impact of inter-rater bias in manual segmentations of medical images on the output of artificial neural networks (ANNs).
Consistent differences in manual image annotations influence supervised ANNs’ training processes. Thus, automatic segmentations of ANNs trained on different sources will be consistently different as well.
Methods and materials:MRIs of MS patients annotated by two radiologists with different levels of expertise were collected. CT scans of intracranial haemorrhage (ICH) patients were annotated twice: manually and semi-manually.
We trained an ANN (U-Net) on annotations from one source and tested its output segmentations on annotations from another source, using Dice scores as matching criteria.
We used classifier ANNs to test (using hit-rate) if two sets of automatic segmentations produced by two identical U-Nets that were trained on different source segmentations could be distinguished.
We calculated MS-lesion loads and ICH volumes based on the raters’ and the U-Nets’ segmentations and compared the differences.
Results:Lower Dice scores were obtained between the outputs of ANNs trained on annotations of one rater and tested on another (cross-evaluation experiment).
Classification hit-rates were higher for U-Nets that were trained on different sources than for the sources' annotations themselves: ICH 0.92 (manual) versus 0.95 (automatic); MS 0.9 (manual) versus 0.93 (automatic).
Differences in volumes calculated based on automatic segmentations of ANNs, trained on different sources, increased and became more consistent.
Conclusion:Segmentation bias between different raters is amplified in ANNs' training as the ANNs generalise the manual segmentation examples to the test data. Therefore, ANNs' segmentation and volume calculation, for the same input images, can be significantly different depending on the training sources.
Limitations:Testing specific segmentation ANN architecture.
Ethics committee approvaln/a
Funding:No funding was received for this work.
10:01D. Winkel, Basel / CH
Purpose:
To showcase a fully automated workflow combining deep reinforcement learning (DRL) for whole-body volumetric analyses and cloud-based post-processing and storing of data on 10,508 whole-body organ volumes.
Methods and materials:431 retrospectively acquired multiphasic CT datasets with 10,508 volumes were included in the analysis (10,344 abdominal organ volumes, 164 lung volumes). AI-based whole-body organ volumes were determined using a multi-scale DRL for 3D body markers detection and 3D structure segmentation. The algorithm was trained for whole-body organ volumetry on 5,000 datasets. The data was uploaded to a cloud-based application with integrated DRL software, allowing to group data by diseases. Total processing time for all volumes and mean calculation time per case was recorded. Repeated measures analysis of variance (ANOVA) were conducted to test for robustness, considering the contrast phase and slice thickness. Final whole-body organ metrics were automatically outputted in comma-separated values format.
Results:The algorithm calculated organ volumes for the liver, spleen, and right and left kidney (mean volumes in mL: 1868.6, 350.19, 186.30, and 181.91, respectively), and for the right and left lung (2363.1 and 1950.9). We found no significant effects of the variable contrast phase or the variable slice thickness on the organ volumes. The mean computational time per case was 10 seconds. The total computational time for all volumes was 1 hour and 11 minutes.
Conclusion:We were able to show that DRL in combination with cloud computing enables a fast processing of substantial amounts of data, allowing to build up organ-specific databases.
Limitations:n/a
Ethics committee approvaln/a
Funding:D.J.W receives research support from the Swiss Society of Radiology and the Research Fund Junior Researchers of the University Hospital Basel (grant no. 3MS1034).
06:14H. Haque, Hino / JP
Purpose:
The annotating process is tedious and time-consuming. Effectively selecting a minimum number of representative unannotated training datasets out of a large-scale heterogeneous dataset is challenging. Our purpose was to identify training datasets iteratively and retrain current model to segment thigh muscle and produce state-of the-art segmentation performance.
Methods and materials:IRB-approved 3,000 clinical CT images were used for this study. A 3D-UNet base model, which was previously trained to segment thigh volume into 11 muscle classes by a small cohort of reference datasets, was used. An iterative active learning framework was developed which identified the datasets out of all datasets where base model segmentation prediction had a higher uncertainty in prediction and could produce a positive effect on segmentation accuracy if used for re-training. Identified datasets were further clustered by their similarity features and 10 representative datasets were selected for annotation in each iteration. All re-trained models performance was evaluated by randomly selected 18 test thigh volume datasets. We compared the average surface distance (ASD) as a segmentation performance metric when the base model was re-trained by the same number of randomly chosen datasets.
Results:After the second iteration, the evaluation indicates a 53% improvement in median ASD over all muscle class as compared to a 22% increase when re-trained with randomly chosen datasets. Paired t-test with a statistical significance score (p < 0.005) shows clear improvement of the segmentation performance.
Conclusion:Retraining a base model with the identified datasets by the active learning framework achieved higher segmentation accuracy in comparison to when trained with a randomly selected dataset. To keep the segmentation performance at the expected range for growing datasets, the framework finds the effective training datasets for annotation and reduces unnecessary annotation workload.
Limitations:The current model does not support images with a metal artefact.
Ethics committee approvalIRB Keio University School of Medicine.
Funding:AMED JP19lk1010025.
06:18A. Meldo, St. Peterburg / RU
Purpose:
To create a general methodology of the generation of databases for machine learning algorithms
Methods and materials:450 anonymised chest CTs were included in the dataset. The RadiAnt DICOM Viewer has been chosen for modifying and dealing with the dataset. The prepared dataset was archived on the server of the oncological centre before the morphological diagnostics. Then the data was anonymised with the DicomCleaner™, renaming them with a special code. We used nodule's shape, internal, and external structure as a feature and represented them into histograms for radiomix. After the morphologic confirmation, we assigned some class label to each case.
Results:We created the database LIRA (Lung Image Resource Annotated) for the development and testing of CAD. All cases were confirmed morphologically. In our study, 65% of LCs had a typical CT, 26% of CT images corresponded to different diseases which require an additional differential diagnostic criteria, and 9% of LC cases were extremely difficult to recognise by means of CT due to an atypical visualisation image. Class labels for subsets were "typical LC", "atypical", and "not cancer".
Conclusion:One of the conditions for successfully using the medical CAD systems is a correctly created database which corresponds with clinical and radiological interpretation.
We would like to point out that the main proposed idea of the methodology of creating the medical databases can be formulated as follows: structurisation of the data, their homogenisation and verification of diseases, and the inclusion of the “atypical” cases and cases which look similar to studying disease.
The dataset LIRA contains class labels for each nodule to develop the differential diagnostic intellectual algorithm.
n/a
Ethics committee approvaln/a
Funding:The reported study was funded by Russian Science Foundation, project number 18-11-00078.
04:53J. van Lunenburg, Hong Kong / HK
Purpose:
Software choice often dictates what radiomic features can be extracted from images. This study analyses how different feature sets affect clinical models in two cancer cohorts: PET oesophagus (in-house dataset) and CT lung (open dataset).
Methods and materials:95 patients (65 training and 30 validation cohorts) who had undergone pre-treatment 18F-FDG-PET studies were included and classified as those who achieved complete pathological response (pCR) and those who did not. The primary tumours were segmented using a fixed threshold approach, radiomic features were extracted, features were annotated according to the international biomarker standardisation initiative. Feature reduction with the minimum redundancy maximum relevance method was performed. Logistical regression and random forest models for clinical outcomes were constructed and compared using ROC curves, decision curves, and McNemar's test.
Results:Of the 5 feature sets tested, AUC values for the two models ranged between 0.6 and 0.87. Reduced feature sets did not have much overlap (10-25%) and showed substantial differences in their ability to predict outcomes with our models and datasets. A combined superset showed the best performance (AUC 0.87) but did not show much reduction in feature correlation.
Conclusion:High-dimensional data requires extensive feature reduction and increased scrutiny for overfitting, but if a proper methodology is applied, the results of combining multiple feature sets may be beneficial for modelling with radiomics in cancer.
Limitations:The PET dataset is not very large (low incidence and expensive modality) and the event frequency is slightly below 40%. The fixed threshold segmentation may be less accurate for lesion size, but it was chosen to maximise radiomics stability.
Ethics committee approvalThe study was approved by the local ethics committee in accordance with the Helsinki Declaration and all patient information was anonymised prior to analysis.
Funding:No funding was received for this work.
05:30R. Illing, Budapest / HU
Purpose:
The value proposition of artificial intelligence (AI) solutions in healthcare have been well described and it is apparent that ‘narrow AI’ will have a role in every stage of the clinical workflow. The deployment of AI solutions in the clinical workflow of a multi-country healthcare organisation has many challenges, but establishing a systematic and unified framework can lead to successful outcomes. By defining measurable performance indicators, it is possible to track in real-time whether actions give benefits to all stakeholders; patients, referrers, and business partners.
Methods and materials:In order to establish the framework of AI deployment in the clinical workflow, working groups were formed at a group and country level between clinical, legal, data protection, digital, operational, commercial, and marketing to identify all areas of relevance in a project management approach.
Results:A unified framework of AI deployment was identified in eight steps:
- The selection criteria for AI solutions, pilot centres, and countries.
- A legal review including medical device class, intended use, and data protection impact assessment.
- A technical architecture review including digital infrastructure requirements and the integration of an AI solution.
- The definition of workflows for the assessment of AI solution benefits.
- The training of healthcare professionals.
- The definition of key performance indicators.
- The commercialisation process.
- The preparation of stakeholders’ information and communication strategy and material.
Conclusion:The framework designed is robust enough to encompass the introduction of different AI solutions across the network. However, minor methodological adjustments are required for each specific AI solution to meet the requirements of use and implementation.
Limitations:n/a
Ethics committee approvaln/a
Funding:No funding was received for this work.
04:01J. Munuera, Barcelona / ES
Purpose:
The gold standard is critical for clinical validation studies but usually time-consuming for researchers. This project aimed to explore whether a deep learning (DL)-based artificial intelligence (AI) diagnostic system would benefit the researchers in establishing the gold standard by reducing the discrepancies between them.
Methods and materials:In this study, we utilised 196 CXR scans which were diagnosed as either pneumonia or effusion and employed the DL AI system (InferRead DR Chest Research, Infervision) to study its effects in eliminating the discrepancies when generating the gold standard. Two senior radiologists participated in the reader study. They reviewed the CXR images with and without the aid of an AI diagnostic system at an interval of 4 weeks. Discrepancies of their reviewing results were analysed and evaluated by Cohen’s Kappa index.
Results:There were 18 cases that were diagnosed differently by the two senior radiologists when they reviewed the CXR images alone. With the assistance of the AI diagnostic system, the number of differentially diagnosed cases dropped to 7 cases, leading to a 61.1% reduction in discrepancies. Of note, the Kappa scores increased from 0.806 to 0.928 after utilising the AI diagnostic system. In particular, according to the ground truth of these CXR images, utilisation of the AI diagnostic system eliminated 17 discrepancy cases in which 8 false-positive cases and 7 false-negative cases were corrected. Meanwhile, among the 6 new introduced discrepancy cases, one was a false-positive case that was corrected by one radiologist with the help of the AI system.
Conclusion:A DL-based AI diagnostic system greatly reduced the diagnostic discrepancies among radiologists and could be utilised to assist the establishment of a gold standard in clinical validation studies.
Limitations:n/a
Ethics committee approvaln/a
Funding:No funding was received for this work.
05:05V. Sorin, Ramat Gan / IL
Purpose:
Natural language processing (NLP) enables the conversion of free text into structured data. Recent innovations in deep learning technology provide improved NLP performance. We aimed to survey deep learning NLP fundamentals and review radiology related research.
Methods and materials:This systematic review followed the PRISMA guidelines. We searched for deep learning NLP radiology studies published up-to September 2019. MEDLINE, Scopus, and Google Scholar were used as search databases.
Results:10 relevant studies published between 2017 and 2019 were identified. Deep learning models applied for NLP in radiology are convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM), and attention networks.
Deep learning NLP applications in radiology includes flagging of diagnoses such as PE and fractures, labelling follow-up recommendations, and automatic selection of imaging protocols. Deep learning NLP models perform equally or better than traditional NLP models.
Conclusion:Research and the use of deep learning NLP in radiology is increasing. An acquaintance with this technology can help prepare radiologists for the coming changes in their field.
Limitations:This systematic review has several limitations. First, heterogeneity of studies and variability in measures between studies prevented a meta-analysis. Second, we limited our search to radiology applications. Deep learning NLP is implemented for the analysis of many kinds of medical texts, not just limited to our field. Finally, deep learning for NLP is a rapidly expanding topic. Thus, there may be concepts and applications published after our review was performed.
Ethics committee approvaln/a
Funding:No funding was received for this work.