Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

Teachers face increasing demands on their time, particularly in adapting mathematics curricula to meet individual student needs while maintaining cognitive rigor. This study evaluates whether AI tools can accurately classify the cognitive demand of mathematical tasks, which is important for creating or adapting tasks that support student learning. We tested eleven AI tools: six general-purpose (ChatGPT, Claude, DeepSeek, Gemini, Grok, Perplexity) and five education-specific (Brisk, Coteach AI, Khanmigo, Magic School, School.AI), on their ability to categorize mathematics tasks across four levels of cognitive demand using a research-based framework. The goal was to approximate the performance teachers will achieve with straightforward prompts. On average, AI tools accurately classified cognitive demand in only 63% of cases. Education-specific tools were not more accurate than general-purpose tools, and no tool exceeded 83% accuracy. All tools struggled with tasks at the extremes of cognitive demand (Memorization and Doing Mathematics), exhibiting a systematic bias toward middle-category levels (Procedures with/without Connections). The tools often gave plausible-sounding explanations likely to be persuasive to novice teachers. Error analysis of AI tools' misclassification of the broad level of cognitive demand (high vs. low) revealed that tools consistently overweighted surface textual features over underlying cognitive processes. Further, AI tools showed weaknesses in reasoning about factors that make tasks higher vs. lower cognitive demand. Errors stemmed not from ignoring relevant dimensions, but from incorrectly reasoning about multiple task aspects. These findings carry implications for AI integration into teacher planning workflows and highlight the need for improved prompt engineering and tool development for educational applications.

翻译：教师面临日益增长的时间压力，特别是在调整数学课程以满足学生个体需求的同时保持认知严谨性。本研究评估了AI工具是否能准确分类数学任务的认知需求，这对于创建或调整支持学生学习的任务至关重要。我们测试了11种AI工具：六种通用工具（ChatGPT、Claude、DeepSeek、Gemini、Grok、Perplexity）和五种教育专用工具（Brisk、Coteach AI、Khanmigo、Magic School、School.AI），考察其使用基于研究的框架将数学任务按四个认知需求等级进行分类的能力。目标是近似模拟教师使用简单提示所能达到的性能。平均而言，AI工具仅在63%的情况下能准确分类认知需求。教育专用工具并未比通用工具更准确，且没有任何工具超过83%的准确率。所有工具在处理认知需求极端等级（记忆与数学实践）的任务时都存在困难，表现出对中间类别等级（有/无联系的程序）的系统性偏差。这些工具经常给出听起来合理的解释，很可能对新手教师具有说服力。对AI工具在认知需求广义等级（高与低）误分类的错误分析显示，工具持续过度重视表面文本特征而忽视底层认知过程。此外，AI工具在推理任务高低认知需求的影响因素方面表现出弱点。错误并非源于忽略相关维度，而是来自对多任务方面的错误推理。这些发现对AI融入教师备课流程具有启示意义，并凸显了教育应用中改进提示工程和工具开发的必要性。