Research Presentation Session: Imaging Informatics and Artificial Intelligence

RPS 1705 - Human and machine factors in artificial intelligence

March 1, 08:00 - 09:00 CET

7 min
Visual acuity among participants of the European Congress of Radiology 2024: should visual assessment be recommended for radiologists?
Thiemo Van Nijnatten, Maastricht / Netherlands
Author Block: T. Van Nijnatten1, M. Smidt1, J. E. Wildberger1, M. Fuchsjäger2, F. J. Gilbert3, F. Pediconi4, R. G. H. Beets-Tan5, F. Van Den Biggelaar1, C. Catalano4; 1Maastricht/NL, 2Graz/AT, 3Cambridge/UK, 4Rome/IT, 5Amsterdam/NL
Purpose: Currently there is no recommendation regarding visual assessment for radiologists. The aim was to evaluate visual acuity among participants of the European Congress of Radiology (ECR) 2024.
Methods or Background: Participants of ECR 2024 organized by the European Society of Radiology (February 28th-March 3rd 2024; Vienna, Austria) were asked to participate in an on-site visual assessment. Medical ethical approval was obtained. Each participant signed written informed consent. The assessment consisted of vision chart reading (Sloan ETDRS Vision Chart) at 66 cm. Afterwards, auto-refraction was performed to determine refractive error. Finally, participants re-read a different vision chart, using on-site glasses to correct for refractive error. A LogMAR score of 0.0 (i.e. Snellen equivalent 1.0/100%) was considered an adequate visual acuity for radiology reporting.
Results or Findings: 321 participants completed on-site visual assessment (41% (132/321) male and 59% (189/321) female). Reported professions were 114 consultant or board certified radiologists (36%), 121 radiology residents (38%), 24 PhD students in radiology (7%), 37 medical students (12%), 11 radiographers (3%), 3 medical physicists (1%) and 11 others (3%). Mean age was 30 years (range: 18-69). Of the 57% (182/321) participants who wore glasses/contact lenses, 171 (94%) wore glasses/contact lenses during image interpretation tasks.
Among all participants, 24 participants (7.5%) did not achieve logMAR score of 0.0 when reading the first vision chart. After auto-refraction measurements, 11 out of these 24 participants improved to a logMAR score of at least 0.0 using on-site glasses to correct for the refractive error.
Conclusion: A considerable percentage of radiologists has accurate visual acuity at a radiology reporting distance of 66 cm. Yet, 7.5% of the participants of the on-site visual assessment did not achieve an adequate vision score. Visual assessment could be considered among radiologists.
Limitations: N/a
Funding for this study: N/a
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Metc 2023-0249.
7 min
Exploring how AI influences human gaze behaviour during mammography reading
Yan Chen, Nottingham / United Kingdom
Author Block: A. Taib1, G. Partridge1, P. Phillips2, J. James1, Y. Chen1; 1Nottingham/UK, 2Lancaster/UK
Purpose: Most studies assess artificial intelligence’s (AI) diagnostic performance in mammography, but few examine its impact on human reader behaviour and decision-making. The aim was to investigate the influence of AI prompts on human performance, visual search patterns, reader confidence and look for any interaction with readers individual personality traits when reading standard 2D screening mammograms.
Methods or Background: In this paired reader study, eight readers working in the UK breast screening programme evaluated a set of 60 anonymised mammograms with and without AI (Lunit Insight MMG). Cases with false negative and false positive AI prompts were incorporated into the test set containing a mix of normal, benign and malignant cases.

Readers initially assessed the mammograms without AI while their visual search behaviour was monitored using eye-tracking equipment (SmartEyePro). After a six-week washout period, readers reviewed the same cases with the addition of AI with eye tracking. For each read, clinical opinion was recorded using a scale (1-normal, 2-benign, 3-indeterminate, 4-suspicious, 5-malignant) and entered onto the Personal Performance in Mammographic Screening (PERFORMS) website.

Each reader completed a specially designed psychological questionnaire.
Results or Findings: A paired analysis at breast level, using pathological data as the ‘ground-truth’, determined how correct and incorrect AI prompts influenced diagnostic accuracy, gaze behaviour and reader confidence. The effect of different reader personality traits was also correlated with these outcomes.
Conclusion: There is little evidence exploring how AI influences a reader’s visual search patterns during mammography interpretation. This pilot study provides an insight into changes in reader behaviour when using AI and will help guide further studies and recommendations on how radiologists should interact with AI when interpreting screening mammography.
Limitations: The limited sample of human readers may lead to type two error.
Funding for this study: By Lunit.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the institutional review board.
7 min
Narrow AI as Double Edged Sword: effects of using AI for fracture detection on distributing attention among focal and peripheral tasks
Ferdinand Mol, Alphen Aan Den Rijn / Netherlands
Author Block: F. Mol1, D. Pourhassan Gilkalaye1, M. H. Rezazade Mehrizi1, W. Grootjans2; 1Amsterdam/NL, 2Leiden/NL
Purpose: Exploring the impact of using AI on distributing attention among focal tasks (diagnosis) versus peripheral tasks (detection of incidental findings) in shoulder radiographs by radiographers.
Methods or Background: 17 radiographers evaluated 255 shoulder radiographs from 15 outpatient trauma patients, with fracture detection as primary task. To assess impact of AI assistance, 170 cases were analysed using commercially available AI software for fracture detection (Gleamer). Both fracture (204) and non-fracture (51) cases were included. Additionally, 102 cases had incidental findings (e.g., pulmonary nodules, bone cysts, rotator cuff calcifications). Eye-tracking (Tobii 5) and mouse-tracking (in-house) software were used to measure attention distribution. Data are presented as mean ± standard deviation, and statistical differences were assessed using Wilcoxon signed-rank test, with significance defined as p<0.05.
Results or Findings: Participants spent more time on cases with AI assistance (n=170), averaging 47±28.24 seconds, compared to 33 ± 28.24 seconds without AI (p < .001). Mouse clicks were 3.85 ± 10.24 without AI, and 12.09 ± 10.24 with AI (p < .001). Eye-tracking data indicated greater attention to peripheral tasks with AI assistance (p = 0.024), while fracture detection was higher without AI (p = 0.038). Radiographers mentioned 31% of incidental findings.
Conclusion: Use of narrow AI tools can increase sensitivity towards peripheral tasks, possibly resulting from enhanced cognitive availability by delegating the focal task to AI. Similarly, measured decrease in fracture detection with AI indicates decrease in attention towards the focal task. This positions narrow AI as a “double-edged sword”: while automation can free up cognitive resources, it can lead to over-reliance and reduced attention to focal tasks.
Limitations: Real-world clinical environment may not be simulated completely in experimental setting. Eye and mouse tracking may not capture all attention distribution aspects.
Funding for this study: N.a.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: N.a.
7 min
Colour map recommendations for MR relaxometry
Barbara Daria Wichtmann, Bonn / Germany
Author Block: B. D. Wichtmann1, M. Fuderer2, N. Desouza3, F. Crameri4, V. Gulani5, N. Sollmann6, S. Weingärtner7, S. Mandija2, X. Golay3; 1Bonn/DE, 2Utrecht/NL, 3London/UK, 4Bern/CH, 5Ann Arbor, MI/US, 6Ulm/DE, 7Delft/NL
Purpose: Quantitative imaging data may be colour coded and represented as a colour-map. However, commonly used schemes (e.g. rainbow, jet) lack perceptual uniformity, have the brightest colour mid-range and are not usable by colour-blind individuals. Furthermore, lack of standardization of colour-maps, makes comparisons across studies and institutions difficult and misleading. This work describes recently published recommendations for standardisation of MR relaxometry colour-maps (Fuderer, MRM 2024) in order to promote their adoption and drive the process for other biomarkers.
Methods or Background: Recommendations were generated in 4 Delphi rounds. A multidisciplinary committee devised questions on key colour-map features, including the colour choice for T1/T2 maps, even colour gradient contrast, high overall colour and lightness contrast, intuitive and constant gradient magnitude, and recognizability. Questions were circulated to the ISMRM quantitative imaging group and European subspecialist society representatives. Respondents received feedback after each round to aid consensus. Responses on a 9-point Likert scale were summarised to Agree, Neutral, Disagree categories. 75% consensus was the threshold for items reaching recommendation.
The proposed colour maps were based on previous proposals (Griswold, ISMRM 2018) but modified for perceptual linearity and readability by colour-blind people.
Results or Findings: 58 experts responded to Round 1; 48 (45% medical, 47% physicists) completed all 4 rounds. There was consensus that the logarithm-processed Lipari colour-map for T1 and the logarithm-processed Navia colour-map for T2 were suitable. Colour bars were deemed mandatory as was a specific value indicating “invalidity”. There was no consensus on whether to fix ranges by anatomy.
Conclusion: The logarithm-processed Lipari colour map for displaying T1 and R1 values and the logarithm-processed Navia colour-map for displaying T2, R2 , T2* and R2* are recommended for use in scientific reports.
Limitations: Future work will focus on range recommendations.
Funding for this study: No funding was provided for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: No patient-sensitive data were processed in this study.
7 min
Evaluating the Impact of Explainable AI on Anchoring and Automation Biases in Mammography Interpretation
Filippo Pesapane, Milan / Italy
Author Block: F. Pesapane, L. Nicosia, S. Carriero, L. Mariano, A. C. Bozzini, A. Latronico, L. Meneghetti, F. Abbate, E. Cassano; Milan/IT
Purpose: This study investigates how AI support influences diagnostic biases (anchoring and automation) among radiologists with varying experience levels in breast imaging. It evaluates whether explainable AI (XAI), using a heatmap, reduces these biases and improves diagnostic accuracy.
Methods or Background: Six radiologists (2 low experience: 0-5 years, 2 medium: 5-10 years, 2 high: >10 years) participated. Each assessed 200 mammograms across two phases: (1) AI BIRADS score presented before diagnosis (anchoring phase), and (2) AI score presented after an independent diagnosis (automation phase). A crossover design was used, with a 30% AI error rate. Two AI conditions were tested: standard (score only) and explainable (score with heatmap). Diagnostic changes, accuracy, and bias frequency were recorded, with subgroup analyses based on experience.
Results or Findings: In the anchoring phase, radiologists altered their diagnoses in 180/400 cases (45%) when AI was incorrect; XAI reduced this to 100/400 (25%). In the automation phase, 220/400 correct diagnoses (55%) changed after AI input; XAI reduced this to 120/400 (30%). Low-experience radiologists showed higher susceptibility, particularly in automation (260/400, 65% change rate). XAI improved accuracy in this group by 80/400 cases (20%). Experienced radiologists demonstrated minimal bias reduction with XAI, indicating experience as a moderating factor.
Conclusion: XAI reduces anchoring and automation biases, especially for less experienced radiologists when AI errors are present. Tailored AI solutions with explainability are crucial for unbiased decision-making in breast imaging.
Limitations: This study involved a small sample size, which may limit generalizability. Future studies should expand the sample and explore the long-term impact of XAI on diagnostic confidence.
Funding for this study: N/A
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Code of approval: UID 4810
7 min
The diagnostic performance of an AI model in prostate cancer detection decreased significantly in reduced scan-quality of biparametric MRIs, while radiologists’ performances did not decrease
Eduardo H. P. Pooch, Amsterdam / Netherlands
Author Block: E. H. P. Pooch1, G. Agrotis1, A. Dehghanpour2, R. G. H. Beets-Tan1, T. Janssen1, I. G. Schoots1; 1Amsterdam/NL, 2Rome/IT
Purpose: To assess the diagnostic performance of artificial intelligence (AI) model and radiologists in detecting Grade Group (GG) ≧2 disease in prostate cancer suspected men, on diagnostic biparametric MRI (bpMRI) scans, considering variations in scan quality as assessed by PI-QUAL scores.
Methods or Background: A nnU-Net GG≧2 cancer segmentation model used 1500 bpMRI scans for training (cohort PI-CAI) and 89 scans for external validation (cohort PROMIS). The external cohort analysis included PI-RADSv2.1 assessment by two readers (R), while one assigned PI-QUALv1 (MRI-quality) scores. The outcome measurement was GG≧2 cancer, based on biopsies. MRI-positive scans were defined as PI-RADS 3-5 scores. The model’s and radiologists' diagnostic performance (AUCs) were compared.
Results or Findings: Overall, the trained model (AUC=0.888) achieved an AUC of 0.652(0.525-0.760) during external validation. At reduced scan quality (PI-QUAL 1-3), the model’s AUC dropped to 0.552(0.350-0.747), while at high-quality scans (PI-QUAL 4-5), the model’s AUC improved to 0.720(0.556-0.855). In contrast, the AUCs of R1 and R2 were 0.733(0.631-0.829) and 0.711(0.614-0.803), respectively, showing a significant difference to the AI model in the reduced-quality group (p<0.04), but not significant in the high-quality group (p>0.99). The readers’ AUCs did not drop at reduced scan quality (AUCs 0.723(0.576-0.875) and 0.727(0.576-0.862) and did not improve at high-quality scans (0.743(0.616-0.848) and 0.695(0.562-0.812)), respectively.
Conclusion: The diagnostic performance of the AI model differed significantly between reduced- and high-quality scans. In contrast, radiologists maintained consistent diagnostic accuracy. To ensure optimal performance, consistently high-quality MRI scans are required for a successful implementation of AI in clinical practice.
Limitations: Limited sample size and only one radiologist provided PI-QUAL scores, which may limit the findings’ generalizability.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: The study was made using public data.
7 min
Variability of classification labels is an important barrier to effective comparison of artificial intelligence software between vendors
Ahmed Maiter, Sheffield / United Kingdom
Author Block: A. Maiter, E. Hesketh, P. Metherall, J. Taylor, S. Alabed, K. Dwivedi, W. Tindale, A. Swift, C. S. Johns; Sheffield/UK
Purpose: Comparing the performance of AI software between different vendors is important for guiding procurement and deployment decisions. This requires consistency in how software outputs are presented. We assessed the number, nature and terminology of classification labels provided by commercially available software from seven vendors for the evaluation of chest radiographs.
Methods or Background: The classification labels provided by the software from each vendor were appraised qualitatively and using descriptive statistics. Synonymous labels were reconciled by merging. Labels were categorised according to their intended purpose. Where relevant, label terminology was compared with the 2024 and 2008 Fleischner Society Glossary of Terms for Thoracic Imaging.
Results or Findings: The median number of labels per vendor was 17 (IQR 7 to 100). Most labels were for the detection of pathology (median 88%, IQR 74% to 94%); these varied from non-specific signs (e.g. ‘bronchovascular markings’) to specific diagnoses (e.g. ‘sarcoidosis’). In some cases, individual vendors provided multiple labels with overlapping meanings (e.g. ‘consolidation’, ‘air bronchogram’, ‘air space opacification’ and ‘alveolar pattern opacity’). Fewer labels were for the detection of devices (median 12%, IQR 0% to 23%); these also ranged from non-specific (e.g. ‘catheter’) to more specific with a decision on adequacy (e.g. ‘suboptimal nasogastric tube’). The median concordance of terminology with the Fleischner Society Glossary was 58% (IQR 50% to 72%).
Conclusion: We identified considerable variability in the labels provided by software, including inconsistent adherence to established terminology. This represents a barrier to effective comparison of performance between vendors and potentially limits the clinical utility of software outputs. Our study highlights the need for better harmonisation of output labels across the AI field.
Limitations: The interpretation of classification labels can be subjective and may differ between assessors.
Funding for this study: This study was funded by the NHS South Yorkshire Integrated Care System.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: This study did not involve patient data, and no ethics committee approval was required.
7 min
Rethinking Radiology Reports: The Perspective of Referring Physicians
Philipp Reschke, Frankfurt / Germany
Author Block: P. Reschke, L. D. Gruenewald, V. Koch, E. Höhne, T. J. Vogl, J. Gotta; Frankfurt/DE
Purpose: High report quality and completeness are essential for efficient patient management. However, the clarity and comprehensiveness of radiology reports are often a point of contention among referring physicians. This study aims to assess referring physicians’ perspectives on the quality and utility of radiology reports in clinical practice.
Methods or Background: A prospective, anonymous online survey was conducted from June 2023 to June 2024, targeting practicing physicians in Germany, including internists, general practitioners, surgeons.
Results or Findings: A total of 149 participants were included: 40% internists, 35,8% general practitioners, 24,2% surgeons. The average satisfaction score for radiology report completeness was 34.4 (±42.3) on a scale from -100 to +100. The primary reasons for incomplete reports were a lack of clinical context (33.3%), missing prior imaging (18.6%), inappropriate imaging techniques (13.8%) and unclear clinical questions (11,6%). Nearly half of the respondents (48.9%) preferred concise reports, while 35.7% opted for medium-length, and only 15.4% favored detailed reports. A majority of participants preferred semi- or fully structured reporting formats (92.5%), with free-text being rarely chosen (7.5%), showing no significant differences across specialties (p=0.08). Most participants (84.1%) found imaging in interdisciplinary case conferences valuable for understanding reports, with 35.9% rating them as “very helpful.”
Conclusion: Referring physicians strongly prefer structured reporting and concise radiology reports. Integrating imaging into interdisciplinary meetings can further improve report comprehension.
Limitations: As the survey was conducted exclusively in Germany, the results may not be directly applicable to other healthcare systems or international settings, where clinical practices and communication standards may differ.
The study focused on a limited number of specialties (internists, general practitioners, surgeons), potentially overlooking the perspectives of other key stakeholders such as neurologists, oncologists, or emergency medicine physicians.
Funding for this study: None.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Not applicable