Intelligent tutoring systems have long enabled automated immediate feedback on student work when it is presented in a tightly structured format and when problems are very constrained, but reliably assessing free-form mathematical reasoning remains challenging. We present a system that processes free-form natural language input, handles a wide range of edge cases, and comments competently not only on the technical correctness of submitted proofs, but also on style and presentation issues. We discuss the advantages and disadvantages of various approaches to the evaluation of such a system, and show that by the metrics we evaluate, the quality of the feedback generated is comparable to that produced by human experts when assessing early undergraduate homework. We stress-test our system with a small set of more advanced and unusual questions, and report both significant gaps and encouraging successes in that more challenging setting. Our system uses large language models in a modular workflow. The workflow configuration is human-readable and editable without programming knowledge, and allows some intermediate steps to be precomputed or injected by the instructor. A version of our tool is deployed on the Imperial mathematics homework platform Lambdafeedback. We report also on the integration of our tool into this platform.
翻译:智能辅导系统长期以来能够在学生作业以严格结构化格式呈现且问题高度受限时提供自动即时反馈,但可靠评估自由形式的数学推理仍具挑战性。本文提出一种系统,可处理自由形式的自然语言输入,处理各类边缘情况,并能针对提交证明的技术正确性及格式与呈现问题提供专业评述。我们讨论了评估此类系统的多种方法的优缺点,并通过所采用的评估指标证明,在评估本科低年级作业时,该系统生成的反馈质量与人类专家相当。我们使用少量较高级别及非常规问题对系统进行压力测试,并报告了在这一更具挑战性场景中存在的显著不足与令人鼓舞的成功案例。本系统采用模块化工作流集成大语言模型。工作流配置具有人类可读性且无需编程知识即可编辑,并允许教师预计算或注入部分中间步骤。本工具的一个版本已部署于帝国理工学院数学作业平台Lambdafeedback。我们亦报告了该工具在此平台中的集成情况。