IMPORTANCE The response effectiveness of different large language models (LLMs) and various individuals, including medical students, graduate students, and practicing physicians, in pediatric ophthalmology consultations, has not been clearly established yet. OBJECTIVE Design a 100-question exam based on pediatric ophthalmology to evaluate the performance of LLMs in highly specialized scenarios and compare them with the performance of medical students and physicians at different levels. DESIGN, SETTING, AND PARTICIPANTS This survey study assessed three LLMs, namely ChatGPT (GPT-3.5), GPT-4, and PaLM2, were assessed alongside three human cohorts: medical students, postgraduate students, and attending physicians, in their ability to answer questions related to pediatric ophthalmology. It was conducted by administering questionnaires in the form of test papers through the LLM network interface, with the valuable participation of volunteers. MAIN OUTCOMES AND MEASURES Mean scores of LLM and humans on 100 multiple-choice questions, as well as the answer stability, correlation, and response confidence of each LLM. RESULTS GPT-4 performed comparably to attending physicians, while ChatGPT (GPT-3.5) and PaLM2 outperformed medical students but slightly trailed behind postgraduate students. Furthermore, GPT-4 exhibited greater stability and confidence when responding to inquiries compared to ChatGPT (GPT-3.5) and PaLM2. CONCLUSIONS AND RELEVANCE Our results underscore the potential for LLMs to provide medical assistance in pediatric ophthalmology and suggest significant capacity to guide the education of medical students.
翻译:重要性:不同的大型语言模型(LLMs)以及各类个体(包括医学生、研究生和执业医师)在小儿眼科咨询中的应答有效性尚未明确确立。目的:设计一份基于小儿眼科学知识的100道题目测试,评估LLMs在高度专业化场景中的表现,并将其与不同级别的医学生和医师的表现进行比较。设计、背景与参与者:本调查研究评估了三种LLM,即ChatGPT(GPT-3.5)、GPT-4和PaLM2,同时与三组人类参与者(医学生、研究生和主治医师)进行比较,考察他们回答小儿眼科相关问题的能力。研究通过LLM网络界面以试卷形式发放问卷,并获得了志愿者的宝贵参与。主要结果与指标:LLMs和人类在100道选择题上的平均得分,以及各LLM的答案稳定性、相关性和应答置信度。结果:GPT-4的表现与主治医师相当,而ChatGPT(GPT-3.5)和PaLM2的表现优于医学生,但略逊于研究生。此外,与ChatGPT(GPT-3.5)和PaLM2相比,GPT-4在应答时表现出更高的稳定性和置信度。结论与关联:我们的研究结果凸显了LLMs在小儿眼科领域提供医疗辅助的潜力,并表明其在指导医学教育方面具有重要能力。