Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.
翻译:自我演化的大语言模型通过自主生成、优化并从自身经验中学习,为实现超级智能提供了一条可扩展的路径。然而,现有训练此类模型的方法仍严重依赖海量人工标注的任务和标签,通常通过微调或强化学习实现,这从根本上制约了人工智能系统向超越人类智能的能力发展。为突破这一限制,我们提出了R-Zero——一个完全自主、从零生成训练数据的框架。该框架以单一基础大语言模型为起点,初始化两个具有不同角色的独立模型:挑战者与求解器。这两个模型分别进行优化,并通过交互实现协同演化:挑战者因提出接近求解器能力边界的任务而获得奖励,求解器则因成功解决挑战者提出的日益困难的任务而获得奖励。这一过程形成了无需任何预存任务与标签的、具有针对性的自我提升学习路径。实验表明,R-Zero显著提升了不同骨干大语言模型的推理能力,例如在数学推理基准上将Qwen3-4B-Base模型提升了+6.49分,在通用领域推理基准上提升了+7.54分。