Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.

翻译：模型能否学会突破自身的学习瓶颈？针对大型推理模型进行微调的强化学习方法在初始成功率较低的数据集上会陷入停滞，从而缺乏有效的训练信号。我们研究了一个根本性问题：预训练的大型语言模型能否利用其潜在知识，为自身无法解决的问题生成自动化课程？为探索这一问题，我们设计了SOAR：一种通过元强化学习挖掘此类教学信号的自我改进框架。该框架中，模型的教师副本为学生副本生成合成问题，并根据其在少量难题子集上的进步获得奖励。关键在于，SOAR将课程设计基于可测量的学生进步，而非内在的代理奖励。我们在数学基准测试中最难子集（初始成功率0/128）上的研究揭示了三个核心发现。首先，我们证明了通过强化预训练模型生成有效"垫脚石"的潜在能力，可以实现双层元强化学习，从而在稀疏二元奖励下开启学习进程。其次，基于实际进步的奖励机制优于先前LLM自我对弈中使用的内在奖励方案，能可靠避免后者通常表现出的不稳定性和多样性崩溃问题。第三，对生成问题的分析表明，问题的结构质量和明确性比答案的正确性对学习进展更为关键。我们的研究结果表明，生成有效"垫脚石"的能力并不需要预先具备解决难题的实际能力，这为无需额外标注数据即可突破推理瓶颈提供了一条原理性路径。