Document Type
Article
Publication Title
Intelligence-Based Medicine
Abstract
Overlapping clinical symptoms between people with multiple sclerosis (PwMS) and those with neuromyelitis optica spectrum disorder (PwNMOSD) can result in misdiagnosis. Large language models, such as ChatGPT, offer accessible tools for preliminary health guidance. We assessed the accuracy of open-access (GPT-3.5) and subscription-based (GPT-4) models in diagnosing MS and NMOSD, and the influences of key diagnostic inflection points (initial MRI findings and aquaporin-4 (AQP4) antibody testing) and subject demographics on model performance. PwMS and PwNMOSD were retrospectively identified within a single academic center, and structured clinical timelines were processed through GPT-3.5 and GPT-4. Seven digital derivatives per subject, varying race, ethnicity, and sex, were also created to assess demographic influences. ChatGPT provided one diagnosis after each timepoint, and diagnostic accuracy was determined using mixed-effects logistic regression. A total of 98 PwMS and 157 PwNMOSD were included, generating 4080 ChatGPT conversations across models and digital derivatives. GPT-4 demonstrated higher diagnostic accuracy for MS (OR=2.67) and NMOSD (OR=1.31), relative to GPT-3.5. Accuracy improved as the clinical time line progressed, although GPT-4 paradoxically performed worse after the initial MRI report for MS cases (OR=0.56). For PwMS, diagnostic accuracy was lower in males (OR=0.81) and older individuals (OR=0.56 per 10-year age increase). Conversely, accuracy was higher for African Americans (OR=1.30) and Asians (OR=1.38) for PwNMOSD. GPT-4 demonstrated higher diagnostic accuracy for both diseases, but superior performance was not uniform across demographic groups. Further, the paradoxical decline in accuracy after MRI interpretation in MS cases suggests context-dependent performance, and responsible interpretation remains necessary.
DOI
10.1016/j.ibmed.2025.100314
Publication Date
11-14-2025
Keywords
Multiple sclerosis, Neuromyelitis optica spectrum disorder, ChatGPT
ISSN
2666-5212
Recommended Citation
Punnen TG, Shan KS, Patel MA, McCreary MC, Tran DH, Santoyo JR, Burgess KW, Moog TM, Smith AD, Okuda DT. Diagnostic Accuracy and Bias in Open Access and Subscription-based Large Language Models for Multiple Sclerosis and Neuromyelitis Optica Spectrum Disorder. Intelligence-Based Medicine. 2025; 12. doi: 10.1016/j.ibmed.2025.100314.
