Document Type

Article

Publication Title

Clinical Imaging

Abstract

Objective: To evaluate the accuracy of large language models (LLMs) in generating Lung-RADS scores based on lung cancer screening low-dose computed tomography radiology reports.

Material and methods: A retrospective cross-sectional analysis was performed on 242 consecutive LDCT radiology reports generated by cardiothoracic fellowship-trained radiologists at a tertiary center. LLMs evaluated included ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced. Each LLM was used to assign Lung-RADS scores based on the findings section of each report. No domain-specific fine-tuning was applied. Accuracy was determined by comparing the LLM-assigned scores to radiologist-assigned scores. Efficiency was assessed by measuring response times for each LLM.

Results: ChatGPT-4o achieved the highest accuracy (83.6 %) in assigning Lung-RADS scores compared to other models, with ChatGPT-3.5 reaching 70.1 %. Gemini and Gemini Advanced had similar accuracy (70.9 % and 65.1 %, respectively). ChatGPT-3.5 had the fastest response time (median 4 s), while ChatGPT-4o was slower (median 10 s). Higher Lung-RADS categories were associated with marginally longer completion times. ChatGPT-4o demonstrated the greatest agreement with radiologists (κ = 0.836), although it was less than the previously reported human interobserver agreement.

Conclusion: ChatGPT-4o outperformed ChatGPT-3.5, Gemini, and Gemini Advanced in Lung-RADS score assignment accuracy but did not reach the level of human experts. Despite promising results, further work is needed to integrate domain-specific training and ensure LLM reliability for clinical decision-making in lung cancer screening.

DOI

10.1016/j.clinimag.2025.110455

Publication Date

3-13-2025

Keywords

ChatGPT, Gemini, LLMs, LungRADS

ISSN

1873-4499

Share

COinS