Research Presentation Session: Artificial Intelligence and Imaging Informatics

RPS 705 - Foundation models: building blocks or building hype?

March 5, 08:00 - 09:30 CET

6 min

Monitoring Black-Box AI Tools for Radiology with a Local Foundation Model

Camila Gonzalez, Vienna / Austria

Author Block: C. Gonzalez, Z. Fang, H. S. Na, D. Larson, A. Chaudhari; Palo Alto, CA/US
Purpose: The offering of AI tools for radiology is expanding rapidly, yet their performance on local data is often unclear. Manually annotating site-specific studies or training local models for each use case is unfeasible. We show how a single vision-language model trained on routinely collected scans and radiology reports can estimate the confidence of commercial AI tools.
Methods or Background: We trained a vision–language model on 4,648 in-house non-contrast head CT studies (median patient age 69.4 years, 43.7% female) and associated radiology reports. We extracted zero-shot predictions for intracranial hemorrhage, midline shift, mass effect, and ischemic stroke by formulating textual prompts and calculating the cosine similarity between positive and negative prompts in the latent space. We utilized those predictions and density-based uncertainty quantification to calibrate three black-box AI models, including an FDA-cleared tool.
Results or Findings: The vendor model reached a sensitivity of 0.64 and specificity of 0.83 on the in-house test set. Distance distributions between false negative and true negative black-box predictions differed significantly across all splits (p < 0.001; Mann-Whitney U), showing that misclassifications can be identified from image embeddings. Selecting different confidence thresholds on validation data increased sensitivity to 0.75 and 0.81, with only moderate rises in false positives.
Conclusion: A single vision-language model trained on routinely collected data can help evaluate the usability of black-box AI tools both before and after deployment. By providing scalable oversight across multiple products and clinical tasks with minimal overhead, the proposed framework supports safer integration of AI into radiology practice.
Limitations: Our current results are limited to data from a single hospital. We plan to make the proposed framework openly available and extend validation to additional clinics in the future.
Funding for this study: No funding
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

Enhancing Ultrasound Image Analysis: Domain Adaptation of Vision-Language Models with Adapters

Jingguo Qu, Hong Kong / China

Author Block: J. Qu, X. Han, T. Xiao, J. Qin, A. D. King, W. C. W. Chu, J. Cai, M. Ying; Hong Kong/HK
Purpose: To develop and evaluate domain adaptation methods for vision-language foundation models (VLMs) to enhance medical ultrasound (US) image analysis. This study addresses the performance limitations of existing VLMs caused by the domain shift from natural to medical images, aiming to improve automated segmentation and classification of lesions in US scans.
Methods or Background: We adapted a pre-trained CLIP model using parameter-efficient fine-tuning. Specifically, we integrated a novel multi-cognitive visual adapter (Mona) into the vision transformer backbone of a frozen CLIP model. For downstream tasks, we designed lightweight segmentation and classification heads incorporating feature map up-sampling and adaptive average pooling to handle variations in lesion size and reduce computational overhead. The framework was evaluated on six public and in-house US datasets (including one external testing set) for lymph nodes, breast, thyroid, and prostate.
Results or Findings: For segmentation, our adapted CLIP model with the Mona adapter without fine-tuning, outperformed all available models and achieved 0.831 in Dice score on the interal lymph node dataset. For classification, supervised fine-tuning was necessary, as zero-shot accuracy was near random chance. The fine-tuned model achieved up to 0.738 accuracy and 0.850 AUC on an external lymph node test set. Fine-tuning on small datasets led to performance degradation in segmentation, indicating catastrophic forgetting.
Conclusion: Our study demonstrated that adapting large-scale VLMs pre-trained on natural images is a highly effective strategy for ultrasound image segmentation. Our proposed method demonstrates superior segmentation capabilities. However, robust classification and overcoming the negative effects of fine-tuning on small datasets remain key challenges for future work.
Limitations: Key limitations include poor zero-shot classification, requiring supervised data for diagnosis, and catastrophic forgetting, where fine-tuning on small datasets degrades robust pre-trained features, especially for segmentation.
Funding for this study: This work was supported by General Research Funds of the Research Grant Council of Hong Kong (Reference no. 15102222 and 15102524).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

ThighSeg: Foundation Model Adaptation for Thigh Composition Analysis in Multi-Sequence MRI

Shengqian Huang, Beijing / China

Author Block: S. Huang, G. Hu, Q. Wang, D. Zhang, Z. Jin, H. Xue; Beijing/CN
Purpose: To develop and validate a universal deep learning framework, based on the adaptation of a large-scale medical vision foundation model, for automated thigh composition analysis across diverse MRI sequences.
Methods or Background: Multisequence MRI data (e.g., T1, T2, STIR, and Dixon sequences) from five public datasets and two local datasets were annotated for sartorius, quadriceps, adductor muscles, gracilis, hamstrings, femur, and subcutaneous tissue. Training set included 498 sequences (222 participants) from TotalSegmentatorMRI, UFATS, HuashanMyo, and local cohort 1; internal testing used 109 sequences (55 participants); two external test sets comprised 54 sequences (27 participants, Folkhälsan) and 154 sequences (19 participants, MyoSegmenTUM) respectively. A 2D nnUNet model fine-tuned from MedDINOv3 for 100 epochs was applied to segment thigh components. Segmentation and measurements were evaluated using Dice similarity coefficient (DSC) and intraclass correlation coefficient (ICC). Six-week resistance-training effects were evaluated in 12 participants from local cohort 1 using paired t-tests; age-related changes were assessed in 1017 participants from local cohort 2 using Pearson correlation.
Results or Findings: The model demonstrated robust segmentation performance on internal (DSC: 0.889-0.956) and external test sets (DSC: 0.806-0.938 for Folkhälsan, 0.834-0.900 for MyoSegmenTUM). Automated measurements of muscle volume and muscle fat fraction showed strong agreement with reference values in both healthy volunteers (ICC: 0.931-0.995) and patients with neuromuscular diseases (ICC: 0.905-0.988). The model detected significant resistance-exercise-induced increases in the volume of the sartorius, quadriceps, and hamstrings (P < 0.01). Muscle volumes correlated negatively with age in the local population cohort, especially for the quadriceps femoris (r = -0.546 in females, -0.450 in males).
Conclusion: ThighSeg is an automated and robust tool for thigh composition analysis.
Limitations: The limitations of the study include the retrospective design and the insufficient assessment of generalization ability to unseen sequences.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the Ethics Committee of Peking Union Medical College Hospital.

6 min

Distinguishing ILA from Atelectasis on Radiographs Using a Foundation Model with Multi-Institutional Validation

Tician Schnitzler, Aarau / Switzerland

Author Block: T. Schnitzler¹, A. P. Gehret¹, H. Zaytoun¹, A. Nowroozi², M. Bondarenko², J. H. Sohn²; ¹Aarau/CH, ²San Francisco, CA/US
Purpose: To evaluate the performance of a radiology vision foundation model (RAD-DINO) combined with a lightweight multi-layer perceptron (MLP) classifier for distinguishing interstitial lung abnormalities (ILA) from atelectasis on chest radiographs, validated externally across institutional datasets.
Methods or Background: A classification pipeline was developed using RAD-DINO as a frozen feature extractor with an appended MLP classifier. Training utilized a curated dataset comprising posteroanterior (PA) chest radiographs of confirmed ILA and atelectasis cases, verified through same-day CT imaging. The internal dataset included 542 ILA and 1,167 atelectasis cases. The external validation cohort consisted of 85 ILA and 100 atelectasis cases. Data were split into training, validation, and internal test sets using an 80:10:10 ratio. Given the limited number of ILA cases, the RAD-DINO backbone remained frozen to mitigate overfitting. Random sampling ensured balanced representation of atelectasis cases during training epochs. The model was trained for 10 epochs using cross-entropy loss and the Adam optimizer. Performance metrics included accuracy, precision, recall, F1 score, and ROC-AUC, assessed on internal and external test datasets.
Results or Findings: The final model demonstrated robust performance with an accuracy of 79.8% on internal validation and 81.1% externally. ROC-AUC scores were 0.865 (internal validation) and 0.890 (external test). The frozen RAD-DINO effectively provided generalizable features, while random sampling enhanced training stability and performance consistency across diverse datasets.
Conclusion: Radiological vision foundation models, combined with a lightweight MLP classifier, effectively distinguish ILA from atelectasis on chest radiographs. Freezing the RAD-DINO backbone facilitated robust transfer learning with limited labeled data, maintaining high generalizability and diagnostic accuracy across multiple institutions.
Limitations: No Limitations.
Funding for this study: Swiss Society of Radiology
Gottfried & Julia Bangerter-Rhyner Foundation
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Institutional Review Board Nr:17-22317.

6 min

How Many Samples to Label for an Application given a Chest X-Ray Foundation Model?

Anton Khardin, Moscow / Russia

Author Block: N. Nechaev, D. Umerenkov, V. Gombolevskiy, E. Przhezdzetskaya, A. Khardin, D. Dylov; Moscow/RU
Purpose: Estimating how many labeled cases are needed to meet a clinical performance target is essential for planning cost-effective model development. We investigate whether power-law fits to early learning curves can predict the training size required to reach an ROC-AUC threshold for chest X-ray (CXR) pathology classifiers built on top of foundation models.
Methods or Background: We constructed pathology-specific binary CXR datasets from MIMIC-CXR using RadGraph-derived labels normalized into 20 distinct classes. For each pathology, we formed train/val/test splits and sampled training subsets, adding negatives at a 1:5 ratio. We evaluated feature-based transfer learning with four encoders: RadDINO-MAIRA-2, XrayCLIP, XraySigLIP, and a ResNet-50 baseline. We then fit a power-law to the observed learning curves and estimated the number of positive cases needed to reach ROC-AUC 0.90—using fits built from limited early points.
Results or Findings: Foundation models substantially reduced labeled data needs versus ResNet-50. Across several pathologies, XraySigLIP/XrayCLIP achieved strong ROC-AUC with n@90 in the tens to low hundreds, while ResNet-50 often required orders of magnitude more data. For example, n@90 dropped from thousands or millions with ResNet-50 to double-digit counts with XraySigLIP. Crucially, fits using ≤50 positive cases provided reliable extrapolations of the eventual plateau. Early-slope magnitude correlated with final ROC-AUC across model–pathology pairs, supporting its use as a planning signal.
Conclusion: A simple protocol—train on small, incremented subsets; fit a power law; extrapolate to a target ROC-AUC—enables practical sample-size estimation for CXR pathologies with foundation-model features. In many cases, ~50–100 positive cases suffice to predict (and often achieve) clinically competitive performance, guiding annotation budgets and deployment timelines.
Limitations: Results are derived from one public dataset with RadGraph-based labels and frozen encoders; prospective, multi-center validation, end-to-end finetuning, and alternative targets (e.g., F1, sensitivity at fixed specificity) warrant study.
Funding for this study: No external funding was received; the work was conducted as part of the authors’ institutional duties.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

Automated monitoring framework for foundation model-generated draft reporting: Example use case in chest radiography

Laura Brink, Reston / United States

Author Block: L. Brink¹, A. Burade², N. Bhatia², K. Schmidt¹, M. K. Kalra², S. Mercaldo², L. Coombs¹, B. C. Bizzo²; ¹Reston, VA, VA/US, ²Boston, MA/US
Purpose: To develop and test a framework for automated performance monitoring of draft reports from foundation models using chest radiographs (CXR) as a use case example.
Methods or Background: Our retrospective, single-site study used 147 adult patients’ CXR reports with radiologist-annotated findings (625 positive findings in total) to evaluate large language models (LLMs) for automated extraction of 233 predefined findings (regardless of criticality, significance, and severity) from each report. The best prompting strategy was then applied to AI draft reports generated with a commercial visual language model (VLM, Harrison.ai) from additional 121 CXRs. Radiologist reports served as the reference. Performance metrics with 95% confidence intervals (CIs) were calculated per finding and summarized at the case level and by finding criticality.
Results or Findings: Both one-shot and zero-shot prompting with Claude-3.5 achieved high accuracy (99.2%, CI:99.1–99.3) and specificity (99.6%, CI:99.6–99.7) for extracting reported findings, with minor sensitivity differences (one-shot: 77.8%, CI:74.4–80.9; zero-shot: 74.9%, CI:71.4–78.2). Using the selected extraction approach, the VLM achieved high accuracy (98.2%, CI:98.0–98.3) but substantially lower sensitivity (39.2%, CI:34.6–44.0) and F1 score (0.38, CI:0.34–0.43), indicating lower performance in finding-positive reports and high performance in normal reports.
Conclusion: Our framework can help automate monitoring of VLM-derived radiology report drafts. Since the findings extraction methods were highly accurate, the framework exposed the VLM’s limited ability to capture true findings despite the high overall accuracy. We expect to scale the framework for additional LLMs and VLMs with temporal performance monitoring.
Limitations: Small sample size, single-site data, single-country LLMs, and the lack of stratified PA/portable CXR analyses limit the generalizability evaluation of our approach.
Funding for this study: No funding was received for this study.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: IRB detials 2020P001792

6 min

Diagnostic Accuracy of Pretrained Medical Foundation Models on Knee X-Rays

Hanbin Ko, Gwangmyeongsi / Korea, Republic of

Author Block: H. Moon, H. Ko, Y. Kim, D. Lee, H. D. Chae, C. M. Park; Seoul/KR
Purpose: To investigate whether general medical foundation models can effectively adapt to downstream imaging tasks beyond their original training domains. Specifically, we evaluate their diagnostic accuracy in knee radiograph analysis—including Kellgren–Lawrence grading, effusion detection, and fracture identification—and compare them with a knee-specialised expert model trained on domain-specific data to assess the added value of task-focused adaptation.
Methods or Background: We retrospectively analysed 110,734 knee radiographic studies (2003–2023) from a tertiary hospital, reserving the most recent two years (10,248 studies) as a temporally separated test set. Labels for Kellgren–Lawrence grade (0–4), effusion, and fractures (acute and periprosthetic) were extracted from structured radiology reports using a rule-based approach. Three models with identical architectures were evaluated: MedSigLIP, a general medical foundation model pretrained via Google Med-Gemma; a knee-specialised model further trained with our images and reports; and a randomly initialised baseline. Performance was assessed using accuracy, F1-score, and AUC on both frontal views and multi-view inputs (frontal, lateral, skyline).
Results or Findings: The knee-specialised model consistently outperformed both MedSigLIP and the randomly initialised baseline. For effusion (all views), accuracy/F1/AUC were 87.7/87.3/93.4 for the knee-specialised model versus 82.2/82.2/91.8 for MedSigLIP and 80.7/79.8/89.1 for random (p<.001 for knee-specialised vs both). For fracture, values were 77.1/78.1/83.2 (knee-specialised) vs 68.9/67.3/72.8 (MedSigLIP) and 70.9/68.6/76.4 (random) (p<.001). For KL grading, accuracies were 56.5%, 54.1%, and 51.1%, respectively, with the knee-specialised model showing modest but significant gain over MedSigLIP (p<.05). Multi-view inputs improved effusion and fracture detection, while frontal views remained superior for KL grading.
Conclusion: General medical foundation models improved performance over models without pretraining, supporting their role in enhancing downstream imaging tasks. Additional task-specific adaptation further boosted diagnostic accuracy, underscoring the complementary value of both general pretraining and domain-specialised refinement.
Limitations: Findings are restricted to internal validation.
Funding for this study: This study was supported by a grant from the Korea Health Industry Development Institute (KHIDI)
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: IRB approval obtained (Seoul University Hospital, No. 2405-070-1536).

6 min

A Foundation Model Framework for Multi-View MRI Classification of Extramural Vascular Invasion and Mesorectal Fascia Invasion in Rectal Cancer

Yumeng Zhang, Maastricht / Netherlands

Author Block: Y. Zhang¹, S. A. Mali¹, H. C. Woodruff¹, S. Amirrajab¹, E. I. Crespo², A. Jimenez-Pastor², L. Marti-Bonmati², Z. Salahuddin¹, P. Lambin¹; ¹Maastricht/NL, ²Valencia/ES
Purpose: Accurate MRI-based identification of extramural vascular invasion (EVI) and mesorectal fascia invasion (MFI) is crucial for risk-stratified rectal cancer treatment. However, subjective visual assessment and inter-institutional variability limit diagnostic consistency. Therefore, this study aims to develop and externally evaluate a multi-center, foundation-model-driven framework that automatically classifies EVI and MFI on axial and sagittal T2-weighted MRI.
Methods or Background: 331 pre-treatment rectal-cancer MRI scans from three European hospitals (La Fe University and Polytechnic Hospital, Unidade Local de Saúde Hospital, and Centre Hospitalier Universitaire d’Angers) were retrospectively analyzed. A self-supervised frequency-domain harmonization pipeline was used to reduce scanner variability. Three classifiers—SeResNet, the universal biomedical pretrained transformer (UMedPT) with a multilayer perceptron (MLP) head, and a logistic-regression variant using frozen UMedPT features (UMedPT_LR)—were trained (n=265) and tested (n=66). Gradient-weighted class activation mapping (Grad-CAM) visualized model predictions.
Results or Findings: UMedPT_LR achieved superior EVI classification using fused axial and sagittal features (area under the receiver operating characteristic curve, AUC = 0.82). Optimal MFI detection occurred with UMedPT using axial harmonized images (AUC = 0.77); these results outperform the challenge winners. Frequency-domain harmonization enhanced MFI performance, with variable effects on EVI. Multi-view fusion, which combined axial and sagittal features, consistently improved EVI classification. Conventional convolutional neural networks (CNNs) underperformed, especially in F1 score and balanced accuracy. Grad-CAM demonstrated appropriate model attention on peritumoral regions (EVI) and mesorectal fascia margins (MFI).
Conclusion: The proposed foundation-model-driven framework leveraging frequency-domain harmonization and multi-view feature fusion achieves state-of-the-art performance in automated MRI classification of EVI and MFI, demonstrating excellent generalizability across multiple centers.
Limitations: Limitations include modest sample size, no center-specific analyses, and limited validation. Larger multi-institutional cohorts, advanced imaging, and in silico trials are needed to improve generalizability and clinical translation.
Funding for this study: Authors acknowledge financial support from ERC advanced grant (ERC-ADG-2015 n° 694812 - Hypoximmuno),, ERC-2020-PoC: 957565-AUTO.DISTINCT. Authors also acknowledge financial support from the European Union’s Horizon research and innovation programme under grant agreement: CHAIMELEON n° 952172 (main contributor), ImmunoSABR n° 733008, EuCanImage n° 952103, TRANSCAN Joint Transnational Call 2016 (JTC2016 CLEARLY n° UM 2017-8295), IMI-OPTIMA n° 101034347, AIDAVA (HORIZON-HLTH-2021-TOOL-06) n°101057062, REALM (HORIZON-HLTH-2022-TOOL-11) n° 101095435, RADIOVAL (HORIZON-HLTH-2021-DISEASE-04-04) n°101057699 and EUCAIM (DIGITAL-2022-CLOUD-AI-02) n°101100633. This study was also supported by the China Scholarship Council grant (202208110055).
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The study was approved by the Ethics Committees of all participating centers: La Fe University and Polytechnic Hospital (Valencia, Spain), Unidade Local de Saúde Hospital (Portugal), and Centre Hospitalier Universitaire d’Angers (France).

6 min

Best Practices for CT Foundation Model Embeddings: Ablation Studies and Cancer Immunotherapy Outcome Prediction

Cristina Mendoza-Moreno, Barcelona / Spain

Author Block: C. Mendoza-Moreno, D. Navarro-Garcia, C. Zatse, A. Marcos Morales, O. Llorian-Salvador, R. Perez Lopez; Barcelona/ES
Purpose: Foundation models (FMs) offer powerful representations of medical imaging data, with the potential to improve performance in downstream tasks. Yet, it is unclear how their performance is affected by common machine learning (ML) pipeline choices. This study evaluates how design choices in data handling and model construction influence the predictive power of FM-derived features, benchmarked against hand-crafted radiomics.
Methods or Background: We analyzed the pre-treatment CT scans from 593 immunotherapy-treated cancer patients. Using a 10-iteration, 5-fold nested cross-validation framework, we conducted ablation studies to compare FM embeddings with hand-crafted radiomics across three endpoints: (a) clinical benefit, (b) lesion growth, and (c) lesion location. We examined the influence of five critical ML pipeline variables on predictive performance (AUROC): (1) class imbalance correction, (2) sample size variation, (3) feature normalization, (4) feature selection optimization, and (5) classifier–feature selector combinations.
Results or Findings: FMs consistently outperformed radiomics, particularly FMCIB embeddings (AUROC up to 0.66 vs. 0.59 for radiomics) for clinical benefit prediction . Standard preprocessing, such as feature normalization, improved performance when using hand-crafted radiomics but had no impact for FM embeddings. Performance did not uniformly degrade with reduced sample size, suggesting robustness to limited data in some tasks. Optimal feature number and classifier–selector combinations varied by endpoint. No universal “best” pipeline existed, and performance was highly context-dependent.
Conclusion: FM embeddings demonstrate superior performance compared to hand-crafted radiomics but require tailored pipelines. Moreover, both the prediction endpoint and the choice of FM embeddings significantly influence performance, underscoring the need for FM-specific workflows rather than one-size-fits-all solutions.
Limitations: This was a single-center study with limited external validation. Binary outcome definitions (e.g., clinical benefit, tumor growth) may oversimplify treatment response, and FM-specific architectural differences were not fully disentangled.
Funding for this study: Not applicable.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This retrospective study was approved by the institutional review board (PR(AG)70/2018). Informed consent was obtained from all patients participating in the clinical trials. The requirement for additional consent for computational image analysis was waived.

6 min

Non-invasive Prediction of MGMT Promoter Methylation in Glioblastoma from Routine MRI Using Foundation Models

Zohaib Salahuddin, Maastricht / Netherlands

Author Block: Z. Salahuddin, C. Cortenraede, S. Amirrajab, H. C. Woodruff, P. Lambin; Maastricht/NL
Purpose: MGMT promoter methylation informs prognosis and treatment response in glioblastoma (GBM), but it is typically assessed invasively. We evaluated whether open-source foundation models can non-invasively predict MGMT methylation status from preoperative MRI, and explored model interpretability and fairness.
Methods or Background: We used multi-sequence MRI (T1, cT1, T2, FLAIR) from the UCSF PDGM and UPenn GBM datasets, restricted to WHO grade 4 GBM. After standardized tumor-centric preprocessing, we obtained 637 patients (70% training, 10% validation, 20% testing) for MGMT methylation status prediction. We benchmarked supervised, self-supervised, and multi-task foundation models (FMCIB, UMedPT, BrainIAC, Models Genesis, Med3D) under three training regimes: (1) frozen extractor + logistic regression, (2) frozen extractor + MLP, (3) full fine-tuning. Input fusion strategies (early/late) and augmentation ablations were tested. The primary metric was ROC-AUC.
Results or Findings: For MGMT status prediction, Models Genesis fine-tuned on FLAIR achieved a ROC-AUC of 0.73, exceeding previously reported MRI-only benchmarks (AUC 0.63). Logistic regression trained on UMedPT embeddings achieved an AUC of 0.69. Kaplan–Meier curves stratified by predicted MGMT labels showed separation comparable to curves stratified by true MGMT labels on the held-out test set. Counterfactual explanations indicated that the model primarily attends to peritumoral tissue rather than the enhancing core. MGMT prediction performance varied by age: AUCs were 0.74 (Q1: 17–54), 0.79 (Q2: 54–63), 0.61 (Q3: 63–71), and 0.48 (Q4: 71–94). Gender performance was balanced (Male: 0.675, Female: 0.658).
Conclusion: Foundation models enable improved, non-invasive MGMT methylation prediction from routine MRI in a clinically meaningful way, with interpretability indicating reliance on peritumoral context. While gender fairness is well maintained, age-related performance disparities warrant mitigation.
Limitations: Further work is needed to explore semantic and clinical feature fusion. Prospective validation and domain adaptation on unseen external datasets are required.
Funding for this study: Authors acknowledge financial support from ERC advanced grant (ERC-ADG-2015 n° 694812 - Hypoximmuno), ERC-2020-PoC: 957565-AUTO.DISTINCT. Authors also acknowledge financial support from the European Union’s Horizon research and innovation programme under grant agreement: CHAIMELEON n° 952172 (main contributor), ImmunoSABR n° 733008, EuCanImage n° 952103, TRANSCAN Joint Transnational Call 2016 (JTC2016 CLEARLY n° UM 2017-8295), IMI-OPTIMA n° 101034347, AIDAVA (HORIZON-HLTH-2021-TOOL-06) n°101057062, REALM (HORIZON-HLTH-2022-TOOL-11) n° 101095435, RADIOVAL (HORIZON-HLTH-2021-DISEASE-04-04) n°101057699 and EUCAIM (DIGITAL-2022-CLOUD-AI-02) n°101100633.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

A Vision Language Foundation Model for MRI Harmonization Achieves High Accuracy in Modality and Anatomy Classification

Greg Zaharchuk, Stanford / United States

Author Block: D. Wang¹, T. C. Arnold², A. Shankaranarayanan¹, G. Zaharchuk¹; ¹Menlo Park/US, ²Philadelphia, PA/US
Purpose: Magnetic Resonance Imaging (MRI) is widely used in both clinical and research settings. However, there is no standard system for naming or categorizing MRI sequences. As a result, MRI sequences often vary in appearance due to differences in imaging protocols, scanner vendors, and institutional practices. The goal of this study is to create an efficient streamlined method to standarize the MRI sequences.
Methods or Background: A vision-language model was designed with BERT and 3D ResNet-18 for text/image feature extraction (36 meta tags and 32 slices). Six transformer layers with two classification heads were then applied for modality and anatomy classification. An expert rule-based method performs final verification by checking metadata fields for sequence types better identified by rules, including MRA, SSFP, and ADC/DWI. We minimize reliance on the original Series Description tag, which is often inconsistent and potentially misleading.
Results or Findings: Anatomies include Neuro (Brain, Neck, CSPINE, TSPINE, LSPINE) and MSK (Wrist, Hip, Elbow, Knee). Modalities include T1, T1c, T2, T2-FLAIR, DWI, MRA, T2*, SWI, SSFP, ASL, ADC, CAL, and LOC, covering nearly all common MRI sequence types. Neuro achieved 97.38% modality and 92.02% anatomy accuracy, while MSK achieved 96.7% modality and 99.5% anatomy accuracy on a comprehensive test dataset comprising 1,501 cases. Our next step is to improve Neuro anatomy accuracy.
Conclusion: The proposed method enables MRI metadata standardization, with high accuracy in MRI sequence classification. This improves the reliability and efficiency of MRI data organization, retrieval, and downstream analysis.
Limitations: Model inference is 8–10 s per series on the app side; future work will also aim to improve inference speed.
Funding for this study: None
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

Explainable Anatomy-Guided AI for Prostate MRI: Foundation Models and In Silico Clinical Trials for Virtual Biopsy-based Risk Assessment

Zohaib Salahuddin, Maastricht / Netherlands

Author Block: Z. Salahuddin¹, D. Khan¹, Y. Zhang¹, S. Kuang¹, S. A. Mali¹, H. C. Woodruff¹, S. Amirrajab¹, R. Cavill¹, E. I. Crespo², A. Jimenez-Pastor², A. Galiana-Bordera², P. Jimenez², L. Marti-Bonmati², P. Lambin¹; ¹Maastricht/NL, ²Valencia/ES
Purpose: To develop and validate a fully automated, anatomically guided deep-learning pipeline that combines foundation models with counterfactual explainability for prostate-cancer (PCa) risk stratification on routine magnetic-resonance imaging (MRI).
Methods or Background: The pipeline includes an nnU-Net module that segments the prostate and its zones on axial T2-weighted MRI, a classification module that fine-tunes the UMedPT Swin-Transformer on 3D patches with optional gland or zonal priors and clinical variables, and a VAE-GAN framework that generates counterfactuals to highlight image regions driving model decisions. Development used 1,500 PI-CAI cases for segmentation and 617 multicentre biparametric MRI exams with clinical data from the CHAIMELEON challenge for classification (70% training, 10% validation, 20% testing). Clinical utility was tested in a paired multicentre in-silico trial where 20 clinicians interpreted a 125-case test set with and without AI support after a 60-day washout.
Results or Findings: The incorporation of gland priors boosted the foundation model’s Area Under the Curve (AUC) from 0.69 to 0.72, and a three-scale ensemble (patch sizes 160–224) obtained the best test performance (AUC = 0.79), surpassing the 2024 CHAIMELEON challenge winners. Counterfactual heat-maps consistently highlighted lesion-containing regions within the segmented gland, providing intuitive, voxel-level explanations of risk predictions. In the prospective in silico trial, AI assistance increased mean diagnostic accuracy from 0.72 to 0.77 and Cohen’s κ from 0.43 to 0.53, while cutting average review time per case from 5.3 min to 3.1 min (≈ 40 % gain).
Conclusion: Anatomy-aware foundation models enriched with gland priors and counterfactual explanations deliver accurate, transparent and time-saving PCa risk stratification on standard MRI, supporting their integration as virtual biopsies in clinical workflows.
Limitations: Domain adaptation and prospective trials are needed to confirm robustness in real-world settings.
Funding for this study: Authors acknowledge financial support from ERC advanced grant (ERC-ADG-2015 n° 694812 - Hypoximmuno), ERC-2020-PoC: 957565-AUTO.DISTINCT. Authors also acknowledge financial support from the European Union’s Horizon research and innovation programme under grant agreement: CHAIMELEON n° 952172 (main contributor), ImmunoSABR n° 733008, EuCanImage n° 952103, TRANSCAN Joint Transnational Call 2016 (JTC2016 CLEARLY n° UM 2017-8295), IMI-OPTIMA n° 101034347, AIDAVA (HORIZON-HLTH-2021-TOOL-06) n°101057062, REALM (HORIZON-HLTH-2022-TOOL-11) n° 101095435, RADIOVAL (HORIZON-HLTH-2021-DISEASE-04-04) n°101057699 and EUCAIM (DIGITAL-2022-CLOUD-AI-02) n°101100633.
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: All datasets were retrospective, and de-identified. No additional institutional review board approval was required.

6 min

Are the foundation models the answer for radiology AI: Evaluating deep learning and foundation model embeddings for binary classification of pediatric foreign body aspiration on chest radiographs

Ilker Özgür Koska, Izmir / Turkey

Author Block: I. Ö. Koska, I. Genişol; Izmir/TR
Purpose: Foreign body aspiration (FBA) in children is a common emergency, yet its early detection on chest radiographs is challenging. We aimed to develop a binary classifier to distinguish FBA from chronic cough patients using chest X-rays and to evaluate the performance of both conventional transfer learning and foundation model embeddings in this context
Methods or Background: A dataset of 251 pediatric chest X-rays was used, including 51 FBA cases and 200 chronic cough cases. A MobileNetV2 model was trained using transfer learning with weighted binary cross-entropy and focal loss to handle class imbalance. Additionally, embeddings were extracted from DINOv2 and TorchXRay foundation models and used to train classical machine learning classifiers: XGBoost, LightGBM, and Support Vector Machine (SVM). Models were evaluated using accuracy, F1-score, and area under the ROC curve (AUC).
Results or Findings: The best-performing model was the MobileNetV2 transfer learning approach, which utilized class-weighted binary cross-entropy loss and the ADAM optimizer, achieving an accuracy of 0.82, a weighted average F1-score of 0.84, and an AUC of 0.91. Among foundation models, the best performance was achieved by DinoV2 embeddings, achieving 0.79 accuracy, 0.79 F1-score, and AUC 0.80
Models built on foundation model embeddings demonstrated inferior performance, highlighting the limitation of directly applying large self-supervised models to radiology tasks with limited datasets.
Conclusion: In this small, imbalanced pediatric chest X-ray dataset, transfer learning with a lightweight CNN (MobileNetV2) outperformed models built on foundation model embeddings. This suggests that foundation models, while promising, may not yet provide a universal solution for radiology AI, particularly in specialized or small-scale clinical datasets.
Limitations: A small dataset and single-center design were the main limitations of this study.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: Behçet Uz Children's Hospital (İzmir) Local Ethical Committee Approval

6 min

Building a Multicenter Radiology Foundation Model with Privacy-Preserving Swarm Learning: The ODELIA Consortium Initiative

JieFu Zhu, Heidelberg / Germany

Author Block: J. Zhu, O. Lester Saldanha; Heidelberg/DE
Purpose: To develop a multicenter radiology foundation model for breast MRI screening using privacy-preserving swarm learning (SL) within the ODELIA consortium, demonstrating feasibility, performance, and clinical relevance across European institutions.
Methods or Background: Swarm learning enables decentralized training of AI models without centralizing sensitive data, addressing privacy and regulatory barriers. The ODELIA consortium connects >8 academic hospitals and research institutes across Europe, pooling an estimated >20,000 breast MRI examinations from heterogeneous scanners and protocols. We implemented an open-source SL framework (Mediswarm) built with Python, NVFlare, Docker, and cross-platform-based coordination. Foundation model pretraining was performed on public imaging datasets, followed by weakly supervised fine-tuning with case-level diagnostic labels. Data never leaves local sites; secure aggregation, differential privacy, and active learning loops ensure robustness and efficiency.
Results or Findings: The swarm-trained foundation model achieved robust performance across institutions. For breast cancer detection, sensitivity reached 94% and specificity 92%, surpassing locally trained models (average sensitivity 88%, specificity 85%). Cross-institutional validation confirmed superior generalizability, with consistent AUROC >0.93 across diverse MRI protocols. Communication overhead was reduced by ~30% and computational load by ~25% compared with centralized training. Training remained stable under simulated node failures, and active learning improved minority-class performance, yielding a 7% gain in detecting small (<15 mm) tumors.
Conclusion: This work represents the first multinational SL implementation in radiology, delivering a scalable, open-source framework for collaborative foundation model training. ODELIA demonstrates that radiology foundation models can achieve robust, generalizable, and privacy-preserving performance in breast MRI, with potential for direct clinical translation and future expansion to additional oncologic imaging modalities.
Limitations: Preliminary results are based on ongoing training; full evaluation on all partner datasets and prospective clinical validation are pending.
Funding for this study: Funded by the European Union Horizon programme (Grant HORIZON-HLTH-2021-CARE-05-02, ODELIA project)
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study has been approved by the local ethics committees of all participating ODELIA consortium institutions. Imaging data are anonymised and processed in compliance with GDPR and national regulations. No raw patient data are exchanged between sites; all model training is performed via privacy-preserving swarm learning.