Large language models are increasingly deployed as intelligent tutors, yet research on aligning them for special education remains absent. Recent work has applied reinforcement learning to LLM tutors, but these methods target a generic learner in a single domain (mathematics) and do not address the cognitive and communicative diversity of learners with disabilities. We introduce \emph{Special-R1}, a framework that extends pedagogical RL to special education through two components: (1) a two-dimensional adaptive system prompt that couples a difficulty-based support level with a disability-specific teaching style across five disability profiles; and (2) a persona-aware Thinking Reward whose judge rubric is conditioned on the learner's disability profile. On a persona-augmented test set of 690 multi-turn dialogues, our full model raises persona-aware Fit from 6.75 (generic baseline) to 8.40 (+1.65) and SPED-rubric Helpfulness from 0.720 to 0.768, leading on the four-component Total (2.911, +0.064 over the runner-up) while remaining within 0.01 of the strongest variant on the out-of-domain OpenLearnLM benchmark (8.53). Ablations show that the Thinking Reward becomes effective only in combination with adaptive prompting, and that residual weakness on specific learning disability in mathematics motivates targeted multimodal extensions.
翻译:大型语言模型正越来越多地被部署为智能导师,然而关于使其适配特殊教育的研究仍属空白。近期工作将强化学习应用于LLM导师,但这些方法针对单一领域(数学)中的通用学习者,未能解决残障学习者在认知与沟通方面的多样性。我们提出Special-R1框架,通过两个组件将教学RL扩展至特殊教育领域:(1)一种二维自适应系统提示,将基于困难度的支持级别与五种残障特征对应的残障特异性教学风格相结合;(2)一种基于人格的思考奖励,其评判准则以学习者残障特征为条件。在包含690轮多轮对话的人格增强测试集上,我们的完整模型将人格适配度从6.75(通用基线)提升至8.40(+1.65),SPED准则帮助度从0.720提升至0.768,在四项综合评价总分(2.911,较第二名提升0.064)中领先,同时在域外OpenLearnLM基准测试上与最强变体保持0.01以内的差距(8.53)。消融实验表明,思考奖励仅在与自适应提示结合时有效,且针对特定数学学习障碍的残余弱点促使我们探索有针对性的多模态扩展方案。