PediaBench：用于大语言模型基准测试的综合中文儿科数据集 (PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models)

The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at https://github.com/ACMISLab/PediaBench.

翻译：大语言模型在医学领域的兴起凸显了对标准化数据集以评估其问答性能的迫切需求。尽管已有多个用于医学问答的基准数据集，但它们要么涵盖不同科室的通用知识，要么是针对儿科以外的其他专科。此外，其中一些数据集仅限于客观题，无法衡量大语言模型的生成能力。因此，它们无法全面评估大语言模型在儿科领域的问答能力。为填补这一空白，我们构建了PediaBench，首个用于大语言模型评估的中文儿科数据集。具体而言，它包含涵盖12个儿科疾病类别的4,565道客观题和1,632道主观题。该数据集采用基于不同难度级别的综合评分标准，以全面评估大语言模型在指令遵循、知识理解、临床案例分析等方面的熟练程度。最后，我们通过对20个开源及商业大语言模型进行广泛实验，验证了PediaBench的有效性。通过对实验结果的深入分析，我们深入探讨了大语言模型在中文语境下回答儿科问题的能力，并指出了其局限性以供进一步改进。我们的代码与数据已发布于 https://github.com/ACMISLab/PediaBench。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日