Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings, yet current supervised fine-tuning methods only learn surface teaching patterns without dynamic adaptation capabilities. Recent reinforcement learning approaches address this limitation but face two critical challenges. First, they evaluate teaching effectiveness solely based on whether students produce correct outputs, unable to distinguish whether students genuinely understand or echo teacher-provided answers during interaction. Second, they cannot perceive students' evolving cognitive states in real time through interactive dialogue, thus failing to adapt teaching strategies to match students' cognitive levels dynamically. We propose the Unidirectional Cognitive Optimization (UCO) method to address these challenges. UCO uses a multi-turn interactive reinforcement learning paradigm where the innovation lies in two synergistic reward functions: the Progress Reward captures students' cognitive advancement, evaluating whether students truly transition from confusion to comprehension, while the Scaffold Reward dynamically identifies each student's Zone of Proximal Development (ZPD), encouraging teachers to maintain productive teaching within this zone. We evaluate UCO by comparing it against 11 baseline models on BigMath and MathTutorBench benchmarks. Experimental results demonstrate that our UCO model outperforms all models of equivalent scale and achieves performance comparable to advanced closed-source models. The code and data are available at https://github.com/Mind-Lab-ECNU/UCO.
翻译:大型语言模型(LLM)在教育场景中正从答案提供者转变为智能导师,然而当前基于监督微调的方法仅能学习表层教学模式,缺乏动态适应能力。近期的强化学习方法虽能缓解这一局限,但仍面临两个关键挑战。首先,这些方法仅依据学生是否产生正确答案来评估教学效果,无法在交互过程中区分学生是真正理解还是单纯复述教师提供的答案。其次,它们无法通过交互式对话实时感知学生动态演化的认知状态,因而难以根据学生认知水平动态调整教学策略。为应对这些挑战,我们提出单向认知优化方法。该方法采用多轮交互式强化学习范式,其创新点在于两个协同作用的奖励函数:进展奖励捕捉学生的认知进阶,评估学生是否真正从困惑状态转变为理解状态;支架奖励则动态识别每位学生的最远发展区,激励教师在该区域内维持高效教学。我们在BigMath和MathTutorBench基准测试中将UCO与11个基线模型进行对比评估。实验结果表明,UCO模型在同等规模模型中表现最优,且性能达到与先进闭源模型相当的水平。代码与数据已公开于https://github.com/Mind-Lab-ECNU/UCO。