Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis
Author Block: E. Can1, W. Uller1, K. Vogt1, F. Busch2, N. Bayerl3, A. Kader2, M. R. Makowski2, K. K. Bressem2, L. C. Adams2; 1Freiburg/DE, 2Munich/DE, 3Erlangen/DE
Purpose: To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8x7b), in simplifying 109 interventional radiology reports.
Methods or Background: Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for analysis.
Results or Findings: Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics (all Bonferroni-corrected p-values: p=1), while they outperformed other models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. All models exhibited some trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8x7B the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84).
Conclusion: GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.
Limitations: This study was based on predefined metrics, which, while comprehensive, may not capture all aspects of patient understanding and engagement. Future research should include real-world data, a broader range of medical documents, and consider patient feedback to more accurately assess the clinical utility of these models.
Funding for this study: This study did not receive any specific funding from public, commercial, or not-for-profit sectors.
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: Since the reports did not include any real patient data, institutional review board approval was not required.
This ensures that the study adhered to ethical standards by avoiding the use of real patient information and thereby eliminating the need for formal ethical approval processes typically required for studies involving human subjects.