Assessing the performance of generative pre-trained transformers against radiologists for PI-RADS classification based on prostate mpMRI text reports
Kang-Lung Lee, Taipei / Taiwan, Chinese Taipei
Author Block: K. L. Lee, D. Kessler, T. Barrett; Cambridge/UKPurpose: Large language models, such as ChatGPT and Bard, have sparked a wave of enthusiasm for their potential applications in clinical radiology, including formulating clinical interpretation of reports. This study aims to compare the classification abilities of ChatGPT, Bard, and two uroradiologists in assigning PI-RADS categories based on clinical text reports.Methods or Background: Clinical prostate MRI text reports from 100 consecutive treatment-naïve patients undergoing mpMRI between - 11.2022 to 28.12.2022 were analysed. Clinical history and concluding remarks were removed from the text reports. Two uroradiologists with 14 and 3 years of prostate MRI reporting experience, retrospectively independently classified PI-RADS 2.1 categories on the edited text reports. The same reports were inputted manually into online ChatGPT-3.5 and Bard platforms to generate PI-RADS classifications (without prior training). Original report classifications were considered definitive, and comparisons were made to compare the original reports, the two radiologists, ChatGPT, and Bard. Agreement rates and Κappa scores were analysed.
Results or Findings: In the original reports, 52/100 MRIs were classified as PI-RADS 2, 9/100 as PI-RADS 3, 19/100 as PI-RADS 4, and 20/100 as PI-RADS 5, respectively. Compared to the original classifications, the senior and junior radiologists concurred on 95% and 90% of the reports, respectively, while ChatGPT and Bard aligned both on 67 reports. Notably, Bard assigned a non-existent PI-RADS 6 classification to two patients (2%). The interreader agreement (Κ) between the original reports and the senior radiologist, the junior radiologist, ChatGPT, and BARD were - 92, 0.85, 0.55, and 0.49, respectively.
Conclusion: Concordance on PI-RADS scoring was high among radiologists, however, ChatGPT and Bard demonstrate poor performance for the text-based classification task.Limitations: The limitation of the study is that this a relatively small sample of 100 reports.Funding for this study: No funding was provided for this study.Has your study been approved by an ethics committee? YesEthics committee - additional information: The study was approved by Cambridge University NHS Foundation Trust (reference number: 288185).