The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at https://github.com/ACMISLab/PediaBench.
翻译:大语言模型在医学领域的兴起凸显了对标准化数据集以评估其问答性能的迫切需求。尽管已有多个用于医学问答的基准数据集,但它们要么涵盖不同科室的通用知识,要么是针对儿科以外的其他专科。此外,其中一些数据集仅限于客观题,无法衡量大语言模型的生成能力。因此,它们无法全面评估大语言模型在儿科领域的问答能力。为填补这一空白,我们构建了PediaBench,首个用于大语言模型评估的中文儿科数据集。具体而言,它包含涵盖12个儿科疾病类别的4,565道客观题和1,632道主观题。该数据集采用基于不同难度级别的综合评分标准,以全面评估大语言模型在指令遵循、知识理解、临床案例分析等方面的熟练程度。最后,我们通过对20个开源及商业大语言模型进行广泛实验,验证了PediaBench的有效性。通过对实验结果的深入分析,我们深入探讨了大语言模型在中文语境下回答儿科问题的能力,并指出了其局限性以供进一步改进。我们的代码与数据已发布于 https://github.com/ACMISLab/PediaBench。