Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are developed and evaluated on three constructed test sets. A medical consultation training set is further constructed to improve the consultation ability of LLMs. The results of the experiments show that fine-tuning with the training set can alleviate hallucinations and improve LLMs' performance on the proposed benchmark. Extensive experiments and ablation studies are conducted to validate the effectiveness and robustness of the proposed framework.
翻译:大型语言模型在与人类交互方面取得了显著成功。然而,近期研究表明这些模型常出现幻觉现象,导致其给出过于自信但错误的判断。这限制了它们在需要极致准确性的医疗领域的应用。本文提出一个自动化评估框架,用于评估大型语言模型在模拟医生进行多轮咨询时的实际能力。咨询任务设计要求大型语言模型能识别自身知识盲区、主动向患者询问缺失的医疗信息,并最终做出诊断。为评估大型语言模型在这些任务上的表现,我们通过重构美国医学执照考试中的医学选择题建立了一个基准测试集,同时开发了综合评估指标,并在三个构建的测试集上进行了验证。此外,我们进一步构建了医疗咨询训练集以提升大型语言模型的咨询能力。实验结果表明,使用该训练集进行微调可有效缓解幻觉现象,并提升模型在基准测试中的表现。通过大量实验和消融研究验证了所提框架的有效性和鲁棒性。