Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.
翻译:大型语言模型(LLM)已在复杂推理任务中展现出强大潜力,但其发展仍从根本上受限于对海量高质量人工标注任务与标签的依赖——无论是通过监督微调(SFT)还是基于特定推理数据的强化学习(RL)。这种依赖性使得依赖密集监督的训练范式日益难以为继,实践中已显现出可扩展性减弱的迹象。为突破此限制,我们提出了CPMöbius(CPMobius),一种面向推理模型无数据强化学习的协作式教练-玩家范式。不同于传统的对抗式自我博弈,受现实世界人类体育协作与多智能体协作启发,CPMöbius将教练与玩家视为独立而协同的角色:教练针对玩家能力提出指令,并根据玩家表现的变化获得奖励;玩家则通过解决教练生成的、指导性逐步增强的任务来获取奖励。这一协同优化循环旨在直接提升玩家的数学推理能力。值得注意的是,CPMöbius在不依赖任何外部训练数据的情况下实现了显著性能提升,超越了现有无监督方法。例如在Qwen2.5-Math-7B-Instruct模型上,本方法将整体准确率平均提升+4.9,分布外准确率平均提升+5.4,在整体准确率上超过RENT方法+1.5,在分布外准确率上超过R-zero方法+4.2。