Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

翻译：近年来，大型语言模型（LLMs）的进步彻底改变了问答（QA）领域。然而，由于缺乏标准化且全面的数据集，在医学领域评估LLMs面临挑战。为解决这一空白，我们引入了源自中国国家医学资格考试的中文医学考试数据集CMExam。CMExam包含6万余道选择题，用于标准化和客观评估，同时提供开放式解答说明，以评估模型的推理能力。为深入分析LLMs，我们邀请医学专业人员标注了五种额外的试题级注释，包括疾病分组、临床科室、医学学科、能力领域及试题难度等级。除数据集外，我们进一步在CMExam上对代表性LLMs和问答算法进行了全面实验。结果显示，GPT-4取得了最佳准确率61.6%和加权F1分数0.617，与人类准确率71.6%相比存在显著差距。在解释任务中，尽管LLMs能生成相关推理并在微调后表现有所提升，但仍未达到理想标准，表明改进空间充足。据我们所知，CMExam是首个提供综合医学注释的中文医学考试数据集。LLM评估的实验与发现也为开发中文医学问答系统和LLM评估管线中的挑战与潜在解决方案提供了宝贵见解。数据集及相关代码发布于https://github.com/williamliujl/CMExam。