There is an increasing interest in developing LLMs for medical diagnosis to improve diagnosis efficiency. Despite their alluring technological potential, there is no unified and comprehensive evaluation criterion, leading to the inability to evaluate the quality and potential risks of medical LLMs, further hindering the application of LLMs in medical treatment scenarios. Besides, current evaluations heavily rely on labor-intensive interactions with LLMs to obtain diagnostic dialogues and human evaluation on the quality of diagnosis dialogue. To tackle the lack of unified and comprehensive evaluation criterion, we first initially establish an evaluation criterion, termed LLM-specific Mini-CEX to assess the diagnostic capabilities of LLMs effectively, based on original Mini-CEX. To address the labor-intensive interaction problem, we develop a patient simulator to engage in automatic conversations with LLMs, and utilize ChatGPT for evaluating diagnosis dialogues automatically. Experimental results show that the LLM-specific Mini-CEX is adequate and necessary to evaluate medical diagnosis dialogue. Besides, ChatGPT can replace manual evaluation on the metrics of humanistic qualities and provides reproducible and automated comparisons between different LLMs.
翻译:近年来,开发用于医学诊断的大语言模型以提升诊断效率的研究日益增长。尽管其技术潜力令人瞩目,但目前缺乏统一且全面的评估标准,导致无法评估医学大语言模型的质量与潜在风险,进一步阻碍了LLM在医疗场景中的应用。此外,现有评估方法严重依赖人工与LLM进行交互以获取诊断对话,并需要人工评估诊断对话质量。为解决缺乏统一全面评估标准的问题,我们基于原始Mini-CEX首次建立了一套名为"LLM专用Mini-CEX"的评估标准,用于有效评估LLM的诊断能力。针对人工交互耗时的问题,我们开发了患者模拟器实现与LLM的自动对话,并利用ChatGPT自动评估诊断对话。实验结果表明,LLM专用Mini-CEX对于评估医学诊断对话具有充分性和必要性。此外,ChatGPT可替代人工评估人文素质相关指标,并能实现不同LLM间可复现的自动化比较。