Large Language Models (LLMs) have demonstrated remarkable proficiency in human interactions, yet their application within the medical field remains insufficiently explored. Previous works mainly focus on the performance of medical knowledge with examinations, which is far from the realistic scenarios, falling short in assessing the abilities of LLMs on clinical tasks. In the quest to enhance the application of Large Language Models (LLMs) in healthcare, this paper introduces the Automated Interactive Evaluation (AIE) framework and the State-Aware Patient Simulator (SAPS), targeting the gap between traditional LLM evaluations and the nuanced demands of clinical practice. Unlike prior methods that rely on static medical knowledge assessments, AIE and SAPS provide a dynamic, realistic platform for assessing LLMs through multi-turn doctor-patient simulations. This approach offers a closer approximation to real clinical scenarios and allows for a detailed analysis of LLM behaviors in response to complex patient interactions. Our extensive experimental validation demonstrates the effectiveness of the AIE framework, with outcomes that align well with human evaluations, underscoring its potential to revolutionize medical LLM testing for improved healthcare delivery.
翻译:大语言模型(LLMs)在人类交互中展现出卓越的能力,但其在医学领域的应用仍未得到充分探索。先前的研究主要集中于通过考试评估医学知识掌握程度,这与真实场景相去甚远,无法充分评估LLMs在临床任务中的实际能力。为推进大语言模型在医疗健康领域的应用,本文提出了自动交互评估框架与状态感知患者模拟器,旨在弥合传统大语言模型评估方法与临床实践精细化需求之间的差距。与依赖静态医学知识评估的现有方法不同,AIE与SAPS通过多轮医患对话模拟,为评估LLMs提供了动态且贴近现实的平台。该方法能更真实地反映临床场景,并可对LLMs应对复杂患者交互的行为进行细致分析。我们通过大量实验验证了AIE框架的有效性,其评估结果与人工评估高度吻合,彰显了该框架在革新医疗大语言模型测试、提升医疗服务水平方面的巨大潜力。