We introduce CFE-Bench (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE-Bench is curated from repeatedly used, authentic university homework and exam problems, paired with reference solutions provided by course instructors. CFE-Bench remains challenging for frontier models: the newly released Gemini-3.1-pro-preview achieves 59.69% overall accuracy, while the second-best model, Gemini-3-flash-preview, reaches 55.46%, leaving substantial room for improvement. Beyond aggregate scores, we conduct a diagnostic analysis by decomposing instructor reference solutions into structured reasoning flows. We find that while frontier models often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically contain more reasoning steps than instructor solutions, indicating lower step efficiency and a higher risk of error accumulation. Data and code are available at https://github.com/Analogy-AI/CFE_Bench.
翻译:我们介绍了CFE-Bench(课堂期末考试),这是一个用于评估大型语言模型在超过20个STEM领域推理能力的多模态基准。CFE-Bench来源于反复使用的、真实的大学作业和考试题目,并配有课程教师提供的参考答案。CFE-Bench对前沿模型仍具挑战性:新发布的Gemini-3.1-pro-preview总体准确率为59.69%,而排名第二的模型Gemini-3-flash-preview达到55.46%,仍有巨大的改进空间。除了总分,我们还通过将教师参考答案分解为结构化推理流程进行了诊断分析。我们发现,尽管前沿模型经常能正确回答中间子问题,但它们在多步求解过程中难以可靠地推导并维持正确的中间状态。我们进一步观察到,模型生成的解决方案通常比教师解决方案包含更多的推理步骤,这表明其步骤效率较低且错误累积风险较高。数据和代码可在 https://github.com/Analogy-AI/CFE_Bench 获取。