Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.5% and a weighted F1 score of 0.616. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

翻译：近年来，大型语言模型（LLMs）的进步彻底改变了问答（QA）领域。然而，由于缺乏标准化和全面的数据集，在医学领域评估LLMs仍面临挑战。为解决这一问题，我们引入了CMExam数据集，该数据集源自中国国家医学资格考试。CMExam包含6万+道选择题，用于标准化和客观评估，并提供开放式解答说明以评估模型的推理能力。为深入分析LLMs，我们邀请医学专业人士标注了五项附加信息，包括疾病分类、临床科室、医学学科、能力领域和问题难度等级。除数据集外，我们还在CMExam上对代表性LLMs和问答算法进行了全面实验。结果表明，GPT-4的最佳准确率为61.5%，加权F1分数为0.616。这些结果与人类71.6%的准确率相比存在显著差距。在解答说明任务中，虽然LLMs能生成相关推理并在微调后表现出性能提升，但尚未达到理想标准，表明仍有较大改进空间。据我们所知，CMExam是首个提供全面医学标注的中文医学考试数据集。LLM评估的实验和发现也为开发中文医学问答系统和LLM评估管道的挑战及潜在解决方案提供了宝贵见解。该数据集及相关代码可在https://github.com/williamliujl/CMExam获取。