Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom's taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic's Alignment Faking methodology to detect behavioral inconsistency under varying monitoring conditions. Evaluation of seven frontier models reveals distinct capability profiles: Claude-Opus-4.5 excels in practical skills despite lower content knowledge, while Grok-4.1-fast leads in knowledge but shows alignment concerns. Notably, no single model dominates all dimensions, validating the necessity of multi-axis evaluation. OpenLearnLM provides an open, comprehensive framework for advancing LLM readiness in authentic educational contexts.
翻译:大语言模型正日益被部署为教育工具,然而现有基准侧重于狭窄的技能且缺乏学习科学的理论基础。我们提出了OpenLearnLM基准,这是一个基于理论构建的框架,依据教育评估理论从三个维度评估大语言模型:知识(与课程内容一致的知识及教学法理解)、技能(通过中心-角色-场景-子场景的四级层次结构组织的基于场景的能力)以及态度(一致性对齐与抗欺骗性)。我们的基准包含超过12.4万个条目,涵盖多个学科、教育角色以及基于布鲁姆分类法的难度等级。知识领域优先采用来自成熟基准的真实评估条目,而态度领域则借鉴了Anthropic的“对齐伪装”方法,以检测在不同监控条件下的行为不一致性。对七个前沿模型的评估揭示了其独特的能力分布:Claude-Opus-4.5在实践技能方面表现出色,尽管其内容知识得分较低;而Grok-4.1-fast在知识方面领先,但显示出对齐方面的隐忧。值得注意的是,没有任何单一模型在所有维度上均占优势,这验证了多轴评估的必要性。OpenLearnLM为推进大语言模型在真实教育场景中的适用性提供了一个开放、全面的框架。