As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.
翻译:随着大语言模型的持续发展,其在教育场景中的应用仍处于探索不足与优化不够的阶段。本文通过引入首个专为教育场景定制的多样化基准,填补了这一空白。该基准包含涵盖9个主要场景及超过4000个不同教育情境的合成数据。为实现全面评估,我们提出了一套多维评估指标,覆盖了与教师和学生相关的12个关键方面。我们进一步采用人工标注,以确保模型生成评估响应的有效性。此外,我们成功在构建的数据集上训练了一个相对小规模的模型,并证明其在测试集上能够达到与前沿大模型(如Deepseek V3、Qwen Max)相当的性能。总体而言,本研究为面向教育的语言模型的开发与评估提供了实用基础。代码与数据发布于 https://github.com/ybai-nlp/EduBench。