Large Language Models (LLMs) are increasingly deployed as automated tutors to address educator shortages; however, they often fail at pedagogical reasoning, frequently validating incorrect student solutions (sycophancy) or providing overly direct answers that hinder learning. We introduce Hierarchical Pedagogical Oversight (HPO), a framework that adapts structured adversarial synthesis to educational assessment. Unlike cooperative multi-agent systems that often drift toward superficial consensus, HPO enforces a dialectical separation of concerns: specialist agents first distill dialogue context, which then grounds a moderated, five-act debate between opposing pedagogical critics. We evaluate this framework on the MRBench dataset of 1,214 middle-school mathematics dialogues. Our 8B-parameter model achieves a Macro F1 of 0.845, outperforming GPT-4o (0.812) by 3.3% while using 20 times fewer parameters. These results establish adversarial reasoning as a critical mechanism for deploying reliable, low-compute pedagogical oversight in resource-constrained environments.
翻译:大型语言模型(LLM)正越来越多地被部署为自动辅导系统以应对师资短缺问题;然而,它们在教学推理方面常常表现不佳,经常错误地认可学生的错误解法(迎合倾向)或提供过于直接的答案从而阻碍学习。我们提出了分层教学监督(HPO)框架,该框架将结构化对抗合成方法应用于教育评估领域。与通常趋于表面共识的合作型多智能体系统不同,HPO强制实现了关注点的辩证分离:专业智能体首先提炼对话上下文,随后以此为基础展开由对立教学批评者参与的、经过调制的五幕式辩论。我们在包含1,214个中学数学对话的MRBench数据集上评估该框架。我们的80亿参数模型取得了0.845的宏观F1分数,以少20倍的参数量优于GPT-4o(0.812)3.3%。这些结果表明对抗推理是资源受限环境中部署可靠、低计算成本教学监督的关键机制。