METHODS: First, a set of evaluation criteria is designed based on a comprehensive literature review. Second, existing candidate criteria are optimized for using a Delphi method by five experts in medicine and engineering. Third, three clinical experts design a set of medical datasets to interact with LLMs. Finally, benchmarking experiments are conducted on the datasets. The responses generated by chatbots based on LLMs are recorded for blind evaluations by five licensed medical experts. RESULTS: The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with sixteen detailed indicators. The medical datasets include twenty-seven medical dialogues and seven case reports in Chinese. Three chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory. Experimental results show that Dr. PJ outperforms ChatGPT and ERNIE Bot in both multiple-turn medical dialogue and case report scenarios.
翻译:方法:首先,基于全面的文献综述设计了一套评估标准。其次,由五位医学与工程领域的专家通过德尔菲法对现有候选标准进行优化。随后,三位临床专家构建一组医学数据集以与大型语言模型交互。最后,在数据集上开展基准测试实验,记录基于大型语言模型的聊天机器人生成的响应,由五位持证医学专家进行盲法评估。结果:获得的评估标准涵盖医学专业能力、社会综合能力、语境能力及计算鲁棒性,包含十六个详细指标。医学数据集包括27个中文医学对话和7份中文病例报告。评估了三个聊天机器人:OpenAI的ChatGPT、百度的文心一言以及上海人工智能实验室的浦江医生(Dr. PJ)。实验结果表明,在多轮医学对话和病例报告场景中,Dr. PJ均优于ChatGPT和文心一言。