As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model's strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above $70\%$, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models' strengths and weaknesses and assist in practical application in the field of psychology.
翻译:随着大型语言模型在各个领域的普及,亟需改进的自然语言处理基准应涵盖学科所有必要知识。当前许多基础模型基准虽强调学科广度,却未能全面呈现关键学科及其必要专业知识。由于LLMs在不同学科和知识领域表现各异,这一缺陷导致了评估结果的偏差。为解决此问题,我们提出psybench——首个覆盖研究生入学考试全部必要知识的综合性中文评估套件。通过多选题形式,psybench能深度评估模型在心理学领域的优劣势。研究发现,模型在同一学科不同部分的表现存在显著差异,突显出测试集知识不均衡时可能产生偏差结果的风险。值得注意的是,仅ChatGPT模型达到平均准确率超过70%,表明仍有较大提升空间。我们期待psybench能帮助全面评估基础模型的优势与不足,并推动心理学领域的实际应用。