Students' handwritten math work provides a rich resource for diagnosing cognitive skills, as it captures intermediate reasoning beyond final answers. We investigate how current large language models (LLMs) perform in diagnosing cognitive skills from such work. However, student responses vary widely, often omitting steps or providing only vague, contextually implicit evidence. Despite recent advances in LLMs' multimodal and reasoning capabilities, their performance under such conditions remains underexplored. To address this gap, we constructed MathCog, a benchmark dataset containing 3,036 diagnostic verdicts across 639 student responses to 110 math problems, annotated by teachers using TIMSS-grounded cognitive skill checklists with evidential strength labels (Evident/Vague). Evaluating 18 LLMs, we find that (1) all models underperform (F1 < 0.5) regardless of capability, and (2) performance degrades sharply under vague evidence. Error analysis reveals systematic patterns: models frequently misattribute Vague evidence as Evident, overthink minimal cues, and hallucinate nonexistent evidence. We discuss implications for evidence-aware, teacher-in-the-loop designs for LLM-based cognitive diagnosis in educational settings.
翻译:学生的手写数学作业为诊断认知能力提供了丰富的资源,因为它捕捉了超越最终答案的中间推理过程。本研究探讨了当前大语言模型(LLMs)在此类作业中进行认知能力诊断的表现。然而,学生作答差异巨大,常省略步骤或仅提供模糊、语境隐含的证据。尽管LLMs在多模态和推理能力方面近期取得了进展,但其在此类条件下的性能仍未得到充分探索。为填补这一空白,我们构建了MathCog基准数据集,该数据集包含针对110道数学题的639份学生作答的3,036项诊断判定,由教师使用基于TIMSS的认知能力清单进行标注,并附有证据强度标签(明确/模糊)。通过对18个LLMs的评估,我们发现:(1)所有模型均表现不佳(F1 < 0.5),且与模型能力无关;(2)在证据模糊的情况下,模型性能急剧下降。错误分析揭示了系统性模式:模型频繁将模糊证据误判为明确证据,对微小线索过度解读,并虚构不存在的证据。我们讨论了这些发现对教育环境中基于LLM的认知诊断系统设计的意义,特别是对证据感知、教师参与循环的设计模式的启示。