Symptom-Only Localization of Brainstem Ischemia: LLM vs Neurologists in 109 DWI-Positive Cases
Author Block: N. Beste1, T. Dratsch1, J. Kottlors1, P. Floßdorf1, A-M. Konitsioti1, L. Volz1, D. Pinto Dos Santos2, M. Schönfeld1; 1Köln/DE, 2Mainz/DE
Purpose: To evaluate the diagnostic accuracy of large language models (LLMs) in localizing brainstem ischemic lesions based solely on neurological symptoms, compared with experienced neurologists.
Methods or Background: We retrospectively included 109 patients with diffusion-weighted imaging (DWI)-confirmed acute brainstem ischemia. Clinical symptoms were provided to three neurologists and five LLMs (GPT-5, GPT-4, GPT-4.1, GPT-4o, o3, o3 pro), which were tasked to predict lesion site (midbrain, pons, medulla) and laterality (left/right). Accuracy, Cohen’s κ, region-specific performance, and correlations with symptom count were analyzed.
Results or Findings: GPT-4 and GPT-4o achieved the highest overall accuracy (56.0%), outperforming GPT-5 (48.6%), GPT-4.1 (41.3%), GPT-o3 (34.9%), GPT-o3 pro (10.1%), and all neurologists (32.1–36.7%). Cohen’s κ was highest for GPT-4o (κ = 0.29). LLMs performed best in pontine strokes (GPT-4: 74.0%, GPT-4o: 68.8%), while performance in midbrain and medulla lesions was substantially lower. A weak but significant correlation between number of symptoms and prediction accuracy was found for GPT-4 (r = 0.28, p < 0.01), GPT-5 (r = 0.26, p < 0.01), and one neurologist (r = 0.29, p < 0.01).
Conclusion: GPT-4 and GPT-4o outperformed neurologists in localizing brainstem lesions based on clinical symptoms alone, while GPT-5 also exceeded human performance but remained less accurate than GPT-4/4o. Accuracy was modest overall, especially outside pontine strokes.
Limitations: Retrospective design, small cohort size, absence of multimodal input, high percentage of pontine strokes and lack of external validation limit generalizability. Prospective studies with integrated imaging and reasoning-augmented models are needed.
Funding for this study: 1. GPT-4 and GPT-4o reached the highest overall accuracy (56.0%), surpassing GPT-5 (48.6%), other LLMs (41.3–10.1%), and all neurologists (32.1–36.7%) in localizing brainstem lesions based on clinical symptoms alone .
2. Agreement with imaging (Cohen’s κ) was highest for GPT-4o (κ = 0.29).
3. Performance was best in pontine strokes (GPT-4: 74.0%, GPT-4o: 68.8%), but substantially lower in midbrain and medulla lesions.
4. Weak yet significant correlations between number of symptoms and accuracy were found for GPT-4 (r = 0.28), GPT-5 (r = 0.26), and one neurologist (r = 0.29).
Has your study been approved by an ethics committee? Not applicable
Ethics committee - additional information: