让模型学会自我教学：可学习性边缘的推理 (Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability)

Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.

翻译：模型能否学会突破自身的学习瓶颈？针对大型推理模型进行微调的强化学习方法在初始成功率较低的数据集上会陷入停滞，因为这些数据集提供的训练信号有限。我们研究了一个根本性问题：预训练的大型语言模型能否利用其潜在知识，为自身无法解决的问题生成自动化课程？为探索此问题，我们设计了SOAR：一个通过元强化学习来挖掘这些教学信号的自我改进框架。模型的一个教师副本为学生副本生成合成问题，并根据其在少量难题子集上的改进获得奖励。关键在于，SOAR将课程设计基于对学生进展的客观测量，而非内在的代理奖励。我们在数学基准测试中最难的子集（初始成功率0/128）上进行研究，揭示了三个核心发现。首先，我们证明了实现双层元强化学习是可能的，它通过锐化预训练模型生成有用"垫脚石"的潜在能力，从而在稀疏的二元奖励下解锁学习。其次，基于客观进展的奖励方案优于先前LLM自我对弈中使用的内在奖励方案，能可靠地避免后者通常表现出的不稳定性和多样性崩溃模式。第三，分析生成的问题表明，问题的结构质量和明确性对学习进展的影响比解答正确性更为关键。我们的结果表明，生成有用"垫脚石"的能力并不需要预先具备实际解决难题的能力，这为无需额外人工标注数据即可突破推理瓶颈提供了一条有原则的路径。