Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
翻译:可验证奖励强化学习(RLVR)已成为解锁大型语言模型(LLM)复杂推理能力的基石。然而,RL的规模化受限于现有可验证数据的不足,导致模型在长时间训练后改进逐渐饱和。为克服此瓶颈,我们提出金鹅(Golden Goose),一种简单的技巧,可通过构建填空任务的单项选择题版本,从未经验证的互联网文本中合成无限的RLVR任务。给定源文本,我们提示LLM识别并掩码关键推理步骤,然后生成一组多样且合理的干扰项。这使得我们能够利用通常被先前RLVR数据构建排除在外的、富含推理内容的未经验证语料库(例如科学教科书),合成GooseReason-0.7M——一个包含超过70万个任务的大规模RLVR数据集,涵盖数学、编程和一般科学领域。实证表明,GooseReason能有效激活在现有RLVR数据上已饱和的模型,在持续RL训练下产生稳健、持续的增益,并在15个多样化基准测试中为1.5B和4B-Instruct模型取得了新的最先进结果。最后,我们将金鹅应用于实际场景,从原始FineWeb抓取的网络安全领域文本中合成RLVR任务,该领域此前不存在任何RLVR数据。使用所得数据GooseReason-Cyber训练Qwen3-4B-Instruct模型,在网络安全领域树立了新的最先进水平,超越了经过广泛领域特定预训练和后训练的7B领域专用模型。这凸显了通过利用丰富、富含推理内容且未经验证的互联网文本,自动扩展RLVR数据的巨大潜力。