25.06.2026 |
Kapelushnik N, Kamar D, Megreli J, Reitblat O, Sella R, Bahar I, Nahum Y, Livny E
Abstract
Purpose: To evaluate the diagnostic performance and limitations of a general-purpose multimodal large language model (GPT-5.4) in detecting early keratoconus (KC) using corneal tomography images.
Methods: This retrospective study included 66 eyes: 15 with subclinical KC, 18 with forme fruste keratoconus (ffKC), and 33 normal controls. Control eyes were obtained from refractive surgery candidates with normal corneas. For each eye, a standardized four-map Pentacam tomography image was analyzed using GPT-5.4 (OpenAI) with a predefined prompt. Only images were provided, without clinical or demographic data. Diagnostic performance for detecting early KC was assessed using sensitivity, specificity, predictive values, accuracy, and receiver operating characteristic (ROC) analysis.
Results: GPT-5.4 demonstrated low sensitivity (30.3%) and high specificity (100%) for detecting early KC, with an overall accuracy of 65.2% and an area under the ROC curve of 0.65. The model showed a strong tendency to classify eyes as normal, resulting in a high rate of false-negative classifications. Multiclass analysis revealed a stage-dependent performance pattern, with excellent classification of normal corneas (100%), moderate detection of subclinical KC (53.3%), and poor detection of ffKC (11.1%), with most ffKC cases misclassified as normal.
Conclusions: GPT-5.4 demonstrates high specificity but limited sensitivity for detecting early KC using image-only input and is not a reliable diagnostic or screening tool in its current form.
Cornea. 2026 Jun 17. doi: 10.1097/ICO.0000000000004230