绝对零度：零数据下的强化自博弈推理 (Absolute Zero: Reinforced Self-play Reasoning with Zero Data)

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

翻译：基于可验证奖励的强化学习（RLVR）通过直接从结果奖励中学习，在增强大语言模型的推理能力方面展现出潜力。近期在零样本设置下运行的RLVR工作避免了对推理过程的标注监督，但仍依赖于人工整理的问题与答案集合进行训练。高质量人工生成示例的稀缺性引发了人们对依赖人类监督长期可扩展性的担忧，这一挑战在语言模型预训练领域已显而易见。此外，在人工智能超越人类智能的假设性未来中，人类提供的任务可能对超智能系统的学习潜力有限。为解决这些问题，我们提出了一种名为“绝对零度”的新型RLVR范式，其中单一模型通过自主提出能最大化其学习进度的任务，并通过解决这些任务来提升推理能力，且完全不依赖任何外部数据。在此范式下，我们引入了绝对零度推理器（AZR），该系统通过代码执行器验证提出的代码推理任务并检验答案，以此作为统一的可靠奖励来源，指导开放而有据的学习，从而实现训练课程与推理能力的自我进化。尽管完全未使用外部数据进行训练，AZR在编程和数学推理任务上实现了整体最先进的性能，超越了依赖数万个领域内人工标注示例的现有零样本模型。此外，我们证明AZR可有效应用于不同规模的模型，并与多种模型架构兼容。