Student Publications

Diagnostic Accuracy and Bias in Open Access and Subscription-based Large Language Models for Multiple Sclerosis and Neuromyelitis Optica Spectrum Disorder

Tom G. Punnen, University of Texas Southwestern Medical Center
Kevin S. Shan, Baylor College of Medicine
Mahi A. Patel, Kansas City University
Morgan C. McCreary, University of Texas Southwestern Medical Center
Diem H. Tran, University of Texas Southwestern Medical Center
Jose R. Santoyo, University of Texas Southwestern Medical Center
Katy W. Burgess, University of Texas Southwestern Medical Center
Tatum M. Moog, University of Texas Southwestern Medical Center
Alexander D. Smith, Texas Tech University Health Sciences Center
Darin T. Okuda, University of Texas Southwestern Medical Center

Document Type

Article

Publication Title

Intelligence-Based Medicine

Abstract

Overlapping clinical symptoms between people with multiple sclerosis (PwMS) and those with neuromyelitis optica spectrum disorder (PwNMOSD) can result in misdiagnosis. Large language models, such as ChatGPT, offer accessible tools for preliminary health guidance. We assessed the accuracy of open-access (GPT-3.5) and subscription-based (GPT-4) models in diagnosing MS and NMOSD, and the influences of key diagnostic inflection points (initial MRI findings and aquaporin-4 (AQP4) antibody testing) and subject demographics on model performance. PwMS and PwNMOSD were retrospectively identified within a single academic center, and structured clinical timelines were processed through GPT-3.5 and GPT-4. Seven digital derivatives per subject, varying race, ethnicity, and sex, were also created to assess demographic influences. ChatGPT provided one diagnosis after each timepoint, and diagnostic accuracy was determined using mixed-effects logistic regression. A total of 98 PwMS and 157 PwNMOSD were included, generating 4080 ChatGPT conversations across models and digital derivatives. GPT-4 demonstrated higher diagnostic accuracy for MS (OR=2.67) and NMOSD (OR=1.31), relative to GPT-3.5. Accuracy improved as the clinical time line progressed, although GPT-4 paradoxically performed worse after the initial MRI report for MS cases (OR=0.56). For PwMS, diagnostic accuracy was lower in males (OR=0.81) and older individuals (OR=0.56 per 10-year age increase). Conversely, accuracy was higher for African Americans (OR=1.30) and Asians (OR=1.38) for PwNMOSD. GPT-4 demonstrated higher diagnostic accuracy for both diseases, but superior performance was not uniform across demographic groups. Further, the paradoxical decline in accuracy after MRI interpretation in MS cases suggests context-dependent performance, and responsible interpretation remains necessary.

DOI

10.1016/j.ibmed.2025.100314

Publication Date

11-14-2025

Keywords

Multiple sclerosis, Neuromyelitis optica spectrum disorder, ChatGPT

ISSN

2666-5212

Recommended Citation

Punnen TG, Shan KS, Patel MA, McCreary MC, Tran DH, Santoyo JR, Burgess KW, Moog TM, Smith AD, Okuda DT. Diagnostic Accuracy and Bias in Open Access and Subscription-based Large Language Models for Multiple Sclerosis and Neuromyelitis Optica Spectrum Disorder. Intelligence-Based Medicine. 2025; 12. doi: 10.1016/j.ibmed.2025.100314.

Download

COinS

Student Publications

Diagnostic Accuracy and Bias in Open Access and Subscription-based Large Language Models for Multiple Sclerosis and Neuromyelitis Optica Spectrum Disorder

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

ISSN

Recommended Citation

Search

Browse

Connect

Student Publications

Diagnostic Accuracy and Bias in Open Access and Subscription-based Large Language Models for Multiple Sclerosis and Neuromyelitis Optica Spectrum Disorder

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

ISSN

Recommended Citation

Share

Search

Browse

Connect