As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model's strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above $70\%$, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models' strengths and weaknesses and assist in practical application in the field of psychology.
翻译:随着大语言模型(LLMs)在各领域的广泛应用,亟需能够涵盖学科必要知识的改进型自然语言处理基准。当前许多面向基础模型的基准虽强调学科覆盖面,却往往未能呈现所有关键学科及其必要专业知识。由于LLMs在不同学科和知识领域表现存在差异,这种缺陷导致评估结果出现偏差。针对该问题,我们提出了psybench——首个全面覆盖研究生入学考试所需知识的综合性中文评估套件。psybench通过多项选择题对模型在心理学领域的优劣势进行深度评估。研究结果表明,同一学科不同知识板块的表现存在显著差异,这凸显了测试集知识分布不均衡时可能产生的偏差风险。值得注意的是,仅ChatGPT模型的平均准确率超过70%,说明该领域仍有巨大提升空间。我们期望psybench能够助力基础模型优缺点的全面评估,并推动心理学领域的实际应用。