Research Presentation Session: Artificial Intelligence and Imaging Informatics

RPS 505 - The segmentation evolution continues: innovative solutions and applications for today's quantitative imaging

March 4, 15:00 - 16:00 CET

6 min

A Hospital-Integrated Human-in-the-Loop Platform for Reproducible Musculoskeletal Segmentation and AI Deployment

John Garcia-Henao, Zurich / Switzerland

Author Block: J. Garcia-Henao, N. Bünger, B. Herzog, S. Caprara; Zurich/CH
Purpose: To evaluate the Medical Imaging Research Orchestration (MIRO) platform for integrating human-in-the-loop (HITL) annotation and AI-assisted segmentation within hospital infrastructure, with a focus on improving reproducibility and collaboration in musculoskeletal imaging.
Methods or Background: The development of accurate AI-assisted segmentation models in orthopaedics requires standardized datasets, clinical validation, and reproducible workflows. The MIRO platform was developed within the MedTwins Agil.IT Trusted Research Environment (TRE) to connect hospital PACS systems with secure research storage and high-performance computing resources.
This study used the Spine Segmentation Dataset, which includes 37 cadaveric CT scans acquired at the Balgrist Research Center and the Swiss Center for Musculoskeletal Imaging. The dataset was designed to be accessible to the scientific community and serves as a benchmark for reproducible spine segmentation.
Three segmentation models, TotalSegmentator, MedSAM2, and SegmentAnyBone, were evaluated using the MIRO platform. Radiologists and surgeons accessed the images and segmentations through the MIRO web interface, enabling direct visualization, refinement, and comparison. Segmentation accuracy was measured using Dice similarity, 95th percentile Hausdorff distance, and surface Dice coefficients.
Results or Findings: TotalSegmentator provided the most consistent delineation of vertebrae. MedSAM2 achieved high performance with guided prompts, and SegmentAnyBone produced detailed segmentation in complex regions. Integration within MIRO allowed efficient dataset management, traceability of annotations, and reproducible comparison across annotators and models.
Conclusion: The MIRO platform enables secure, collaborative, and reproducible segmentation research for musculoskeletal imaging within hospital environments.
Limitations: Preliminary findings are based on early-stage evaluations; large-scale multi-institutional validation is ongoing.
Funding for this study: This study was supported by the Digitalization Initiative of the Zurich Higher Education Institutions (DIZH) under the project MedTwins Agil.IT (2024–2026).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

Zero-shot performance of promptable medical images segmentation model

Astha Jaiswal, Cologne / Germany

Author Block: A. Jaiswal, F. Meyer, L. Oberlinkels, R-U. Müller, N. Große Hokamp, L. Caldeira, T. Persigehl; Cologne/DE
Purpose: Robust segmentation of structures of interest such as organs, tumors is crucial for clinical applications including diagnosis, treatment planning, and disease monitoring. In this work, we performed evaluation of two public medical image segmentation models.
Methods or Background: We collected two datasets (DS1, DS2) from University Hospital Cologne. DS1 includes 102 3D MR-scans of 50 autosomal dominant polycystic kidney disease patients. DS2 consists of 40 MR-scans from 40 prostate cancer patients. We evaluated the performance of TotalSegmentator MRI (TS)[1] and prompt based model nnInteractive (NI)[2] for polycystic kidney segmentation in DS1 and prostate segmentation in DS2. NI was evaluated using a single interaction via a randomly inflated 2D bounding box around each kidney and around the prostate. Dice scores and Wilcoxon Signed-Rank Test were used to compare the performance.
Results or Findings: For kidney segmentation, TS resulted in dice score of 0.32±0.28 (0.03, 0.90). NI outperformed with a dice of 0.78±0.17 (0.09, 0.97)(p<.001). Polycystic kidneys are deformed due to multiple cysts and have different appearance compared to normal kidneys. Though nnInteractive was not trained on polycystic kidney data, it captured previously unseen objects well even with single interaction. The prostate region often did not have clear boundaries, likely confusing the NI model resulting in dice of 0.80±0.14 (0.31, 0.92). For prostate segmentation, TS outperformed NI with a dice of 0.85±0.03 (0.74, 0.91)(p<.001).
Conclusion: NI generalizes well on the new tasks and allows generating high quality segmentations with one or few interactions.
Limitations: In future, we will test NI on different datasets and with multiple interactions including different prompts.

References:
[1] D’Antonoli et. al., "Totalsegmentator mri: Robust sequence-independent segmentation of multiple anatomic structures in mri." Radiology 2025.
[2] Isensee et. al., "nninteractive: Redefining 3d promptable segmentation." arXiv preprint 2025.
Funding for this study: This work has been supported by RACOON „NUM 2.0“(FKZ: 01KX2121).
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The approval numbers are 15-323-retro, 23-1193-retro.

6 min

Federated Learning for Brain Tumor Segmentation: Privacy-Preserving Multi-Institutional AI Model Development

Elshan Abdullayev, Doha / Qatar

Author Block: E. Abdullayev; Baku/AZ
Purpose: To develop and validate a federated learning approach for automated brain tumor segmentation in MRI that maintains data privacy while leveraging multi-institutional datasets for improved model performance.
Methods or Background: This retrospective multi-center study implemented a federated learning framework across five institutions using 3,420 brain MRI scans (T2-weighted, FLAIR, T1-Gd) from patients with glioblastoma, meningioma, and metastatic lesions collected between 2019 and 2024. A 3D U-Net architecture was trained using a federated averaging algorithm, where each institution trained locally without sharing raw data. Ground truth segmentations were established by consensus of two neuroradiologists. Model performance was compared against centralized learning and single-institution models using Dice similarity coefficient (DSC), Hausdorff distance (HD95), and sensitivity/specificity metrics. Statistical significance was assessed using Wilcoxon signed-rank tests.
Results or Findings: The federated learning model achieved a mean DSC of 0.763 ± 0.124 for the whole tumor, 0.698 ± 0.156 for the tumor core, and 0.612 ± 0.189 for the enhancing tumor regions. Compared to single-institution models, federated learning showed moderate improvement in DSC (0.763 vs 0.721, p=0.023) and HD95 (8.4mm vs 11.2mm, p=0.041). Performance was slightly lower than centralized learning (DSC: 0.763 vs 0.791, p=0.031) but maintained complete data privacy. The model showed variable performance across different institutions (DSC range: 0.698-0.812), with challenges in generalizing different MRI protocols. Training convergence required 15% more iterations compared to centralized approaches due to data heterogeneity.
Conclusion: Federated learning enables the development of brain tumor segmentation models while preserving patient data privacy, though with some performance trade-offs compared to centralized approaches. Despite challenges with data heterogeneity across institutions, this method shows promise for collaborative AI development in medical imaging.
Limitations: Single time-point analysis, heterogeneous scanner types, absence of an external validation cohort, and limited tumor subtype representation.
Funding for this study: No funding received for this study.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

Multimodal learning for automated segmentation and preoperative risk stratification of endometrial cancer via multi-sequence MRI

Xiuping Nie, Hong Kong / Hong Kong SAR China

Author Block: X. Nie¹, G. Liu², Y. Dong², X. Wang¹; ¹Hong Kong/HK, ²Shenyang/CN
Purpose: Risk stratification in endometrial cancer (EC) is essential for treatment planning, but is currently determined by postoperative pathology. This study aims to develop and validate a multimodal learning model for non-invasive, preoperative risk prediction in EC patients using multi-sequence MRI, thereby enabling more individualized surgical and adjuvant treatment strategies.
Methods or Background: This multicenter retrospective study included 623 paired axial contrast-enhanced T1-weighted (CE-T1W) and fat-suppressed T2-weighted (FS-T2W) MRI scans obtained from EC patients across three sources. A two-stage multimodal learning framework was designed: in Stage I, the model was pretrained for 3D tumor segmentation and radiomics feature extraction; in Stage II, the model was further finetuned to integrate traditional radiomics features and multiscale deep learning image features for risk stratification into low-, intermediate-, and high-risk groups according to the ESGO/ESMO/ESP 2020 guidelines. Experiments on multicenter cohorts evaluated tumor segmentation and risk stratification with standard metrics.
Results or Findings: The proposed AI model outperformed the widely used nnUNet in 3D tumor segmentation, achieving Dice scores of 0.753 ± 0.149 on CE-T1W MRI and 0.764 ± 0.129 on FS-T2W MRI in the internal validation cohort, and maintained high accuracy in two external validation cohorts. For preoperative risk stratification, the model achieved an AUC of 0.801 (95% CI: 0.712-0.880) in the internal validation cohort, with AUCs of 0.852, 0.746, and 0.799 for the high-, intermediate-, and low-risk groups, respectively. Moreover, the performance observed in external validation cohorts (AUCs of 0.740 and 0.724) demonstrated the model’s robust generalizability for risk stratification.
Conclusion: Our results highlight the model’s potential to support preoperative risk stratification in EC patients, particularly by accurately identifying high-risk groups. This may aid preoperative clinical decision-making and ultimately improve patient outcomes.
Limitations: Not applicable
Funding for this study: Not applicable
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The ethics committee notification can be found under the number XJS202506107.

6 min

Deep Learning-Based MRI Segmentation of the Uterus for Early Adenomyosis Detection Across Menstruation and Ovulation

Chiara Tappermann, Bremen / Germany

Author Block: C. Tappermann¹, M. S. May², L. Siegler², T. Rüttinger², M. Fenske², M. B. Bauer², L. Kratzsch², S. Arndt², B. Lassen-Schmidt¹; ¹Bremen/DE, ²Erlangen/DE
Purpose: The RACOON FADEN project investigates MRI-based uterine biomarkers for early adenomyosis detection, a gynaecological condition in which endometrial tissue grows into the muscular wall of the uterus, often associated with uterine enlargement, pelvic pain, and infertility.
This research includes volumetric segmentation of the myometrium (MM), junctional zone (JZ), and endometrium (EM) during menstruation and ovulation.
Within the project, tailored deep learning models are trained to partially automate this process. Their performance should be assessed for both phases.
Methods or Background: The test dataset includes 16 females from six German university hospitals. Reference segmentations were created by medical students and reviewed by radiologists using a CuraMate workflow based on predefined guidelines.
MM, JZ, and EM were segmented up to the cervical junction on T2-weighted short-axis uterine images using motion-insensitive, multi-shot TSE BLADE sequences.
Models were trained iteratively in three rounds on additional training data, incorporating more data each time (32/98/122 samples, comprising both menstruation and ovulation phases).
Results or Findings: Model performance was measured with the Dice Similarity Coefficient (DSC), resulting in DSC 0.71 (MM), 0.68 (JZ), 0.63 (EM) during menstruation and DSC 0.66 (MM), 0.65 (JZ), 0.74 (EM) during ovulation for model 1. Model 2 achieved DSC 0.76 (MM), 0.74 (JZ), 0.75 (EM) during menstruation and DSC 0.74 (MM), 0.77 (JZ), 0.87 (EM) during ovulation. Model 3 reached DSC 0.73 (MM), 0.70 (JZ), 0.75 (EM) during menstruation and DSC 0.72 (MM), 0.73 (JZ), 0.84 (EM) during ovulation.
No significant differences were observed between paired DSC values for menstruation and ovulation (Wilcoxon signed-rank test; all p > 0.1).
Conclusion: Our deep learning models reliably segment MM, JZ, and EM on T2-weighted MRI during both phases, with no significant performance differences, supporting automated assessment of early adenomyosis.
Limitations: No Limitations.
Funding for this study: Funding was provided by the Bundesministerium für Bildung und Forschung via Netzwerk Universitätsmedizin (NUM 2.0, FKZ: 01KX2121).
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The RACOON FADEN study was approved by the ethics committee of all participating university hospitals. The study protocol complies with the declaration of Helsinki.

6 min

MRIvals - TotalSegmentator vs MRSegmentator. Validation of major organs segmentation in abdominal scans. A comparative study

Georgios Lappas, Athens / Greece

Author Block: G. Lappas¹, A. Afentouli¹, N. Patlakas¹, P. Giannikopoulos¹, M. Triantafyllou², G. I. Kalaitzakis², M. Klontzas², K. Petropoulos¹; ¹Athens/GR, ²Heraklion/GR
Purpose: This study aims to validate and compare the open-source TotalSegmentator 3mm (TS3MR) and MRSegmentator (MRS) models for the segmentation of major abdominal organs in Magnetic Resonance (MR) scans.
Methods or Background: The models' segmentation capability was quantified using Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD) across four heterogeneous datasets (N≈1600 individuals & 100 scanners). Exploratory Data Analysis, including basic statistics, was performed to investigate the suitability of the datasets for benchmarking. Additionally, explainable AI (i.e., Grad-CAM maps) was utilized focusing on disentangling the black box of the model decision-making. Clinical rating of the predicted segmentations’ quality was conducted by one assistant professor of radiology, and one senior radiology resident from Greek hospital scans (N=10 patients).
Results or Findings: Overall, despite the high data variability, projected in normalized volume (0.44±0.21) and intensity (0.36±0.22), both MR models can accurately segment most abdominal organs, e.g., average DSC > 90%. MRS was more accurate across all organs compared to TS3MR as DSC scores were boosted from 4% to 28% depending on sequence, pathology and anatomical region. Poorer and less robust performance was found for the gallbladder, and pancreas with an average DSC ≈ 63% for both models. Grad-CAM maps revealed the models’ inconsistencies. The clinical evaluation findings showed high-quality and clinical acceptable segmentations for both models while fine-grained details were generated by MRS.
Conclusion: Both TotalSegmentator 3mm and MRSegmentator models can accurately segment most of the major abdominal organs in MR scans while the latter is more accurate across all organs and modalities.
Limitations: The available MR data presents lower variability (for scanning protocols and demographics) for kidneys, spleen and gallbladder. Segmentation of lower abdomen organs such as prostate is not supported by MRSegmentator even though routine examined.
Funding for this study: None
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information:

6 min

Deep Learning (DL)-Based Detection and Segmentation of Acute Pulmonary Thromboembolism (PTE) in Computed Tomography Pulmonary Angiography (CTPA): A Clinical Validation Study

Ekin Cinar, Izmir / Turkey

Author Block: M. A. Kamar¹, M. M. Baris², E. Konukoglu³, E. Cinar², N. Cakir², T. Yonka², A. Ozgen Alpaydin², N. S. Gezer²; ¹İzmir/TR, ²Izmir/TR, ³Zurich/CH
Purpose: To clinically validate a DL model for fully automated detection and segmentation of acute PTE on CTPA, and to assess the impact of imaging artifacts and contrast enhancement quality on model performance.
Methods or Background: A total of 530 CTPA examinations (157 PTE-positive, 373 PTE-negative) were retrospectively evaluated. Emboli were manually segmented slice-by-slice by experienced radiologist to establish the ground truth. Each CTPA scan is classified by the DL model as positive or negative; then voxel-level segmentation of emboli across all slices in a fully automated end-to-end manner is performed for positive scans. The model’s performance in embolus detection was evaluated. Subgroup analysis classified emboli from the main pulmonary artery to lobar branches as central, and segmental to subsegmental emboli as peripheral. False positive(FP) and false negative(FN) cases were analyzed for image artifacts and pulmonary artery attenuation (HU). Embolus volume analysis was also performed.
Results or Findings: The AI model achieved overall sensitivity of 91.7%, specificity of 92.0%, precision of 82.8%, and accuracy of 91.9%. In peripheral emboli cases (n=57), the model showed 84.2% sensitivity. In central emboli (n=100), sensitivity reached 96.0%. Among 13 FN cases, 10 were associated with technical limitations including low contrast density in the main pulmonary artery (<250HU), streak artifacts, motion artifacts, and low tube current. Similarly, 22 of 30 FP cases were linked to similar artifacts, particularly vena cava streaks and suboptimal contrast timing. Only 3 FN and 8 FP cases occurred without any identifiable artifact.
Conclusion: The DL model demonstrated high performance especially for centrally located emboli. Image artifacts and low pulmonary artery contrast enhancement significantly contributed to false results, highlighting the importance of standardized acquisition protocols for reliable AI-assisted diagnosis.
Limitations: The model’s volumetric segmentation errors were primarily in small-volume thrombi.
Funding for this study: None
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: This study was approved by the Non-Interventional Research Ethics Committee of Dokuz Eylül University Hospital

6 min

Autosegmentation of Tumor-Bearing Bladders on CT: A Comparison of U-Net Variants and an Ensemble Model

Li Chen, Beijing / China

Author Block: L. Chen¹, E. Guo¹, Z. Wu², Z. Jin¹, G. Zhang¹, H. Xue¹, H. Sun¹; ¹Beijing/CN, ²Fushun/CN
Purpose: To evaluate and compare the performance of 2D U-Net, 3D U-Net, Dual Swin Transformer U-Net (DS-TransUNet), and an Ensemble Segmentation Model (ESM) for automated segmentation of tumor-bearing bladders on CT, a critical task in computer-aided diagnosis of bladder cancer (BCa).
Methods or Background: This retrospective study included 435 BCa patients (397 internal, 38 external) with manual segmentation as ground truth. Models were trained and validated through five-fold cross-validation within the nnU-Net framework. The ESM was constructed by aggregating predictions from three individual models using majority voting. Performance metrics included the Dice similarity coefficient (DSC), average surface distance, and 95th-percentile Hausdorff distance, with subgroup analyses by gender and muscle invasion status (muscle-invasive BCa / non-muscle-invasive BCa).
Results or Findings: The ESM achieved the highest DSC in both internal (0.980±0.024) and external (0.986±0.014) cohorts, significantly outperforming DS-TransUNet (P < 0.001) and 3D U-Net in internal validation (P = 0.047), and again outperformed DS-TransUNet (P < 0.001) and showed a marginally significant advantage over 2D U-Net (P = 0.045) in external validation. 3D U-Net and 2D U-Net also showed strong performance, particularly in external validation, while the DS-TransUNet model had the worst relative segmentation efficiency. Subgroup analyses confirmed stable performance of ESM across gender and tumor invasion categories.
Conclusion: The 3D U-Net excels in segmentation accuracy, while the 2D U-Net is efficient and consistent. DS-TransUNet, though promising, needs refinement for robustness. The ESM model, integrating advantages from multiple U-Net-based architectures, demonstrated optimal accuracy and robustness, highlighting its potential clinical utility.
Limitations: First, the relatively small external validation cohort potentially limits broader applicability. Additionally, cases with significant imaging artifacts were excluded, potentially affecting real-world generalizability.
Funding for this study: This work was supported by Peking Union Medical College Hospital Talent Cultivation Program (Category D) [grant number UHB11588]; National High- Level Hospital Clinical Research Funding [grant numbers, 2022-PUMCH-A-035, 2022-PUMCH-B-069, 2022-PUMCH-A-033, and 2022-PUMCH-B-068]; the Beijing Municipal Natural Science Foundation [grant number L232133] and the CAMS Innovation Fund for Medical Sciences (CIFMS) [grant number 2024-I2M-C&T-C-004].
Has your study been approved by an ethics committee? Yes
Ethics committee - additional information: The approval of the Institutional Review Board of the affiliated hospital of Peking Union Medical College has been obtained.