Advancing complex reasoning in large language models relies on high-quality, verifiable datasets, yet human annotation remains cost-prohibitive and difficult to scale. Current synthesis paradigms often face a recurring trade-off: maintaining structural validity typically restricts problem complexity, while relaxing constraints to increase difficulty frequently leads to inconsistent or unsolvable instances. To address this, we propose Agentic Proposing, a framework that models problem synthesis as a goal-driven sequential decision process where a specialized agent dynamically selects and composes modular reasoning skills. Through an iterative workflow of internal reflection and tool-use, we develop the Agentic-Proposer-4B using Multi-Granularity Policy Optimization (MGPO) to generate high-precision, verifiable training trajectories across mathematics, coding, and science. Empirical results demonstrate that downstream solvers trained on agent-synthesized data significantly outperform leading baselines and exhibit robust cross-domain generalization. Notably, a 30B solver trained on only 11,000 synthesized trajectories achieves a state-of-the-art 91.6% accuracy on AIME25, rivaling frontier-scale proprietary models such as GPT-5 and proving that a small volume of high-quality synthetic signals can effectively substitute for massive human-curated datasets.
翻译:提升大型语言模型的复杂推理能力依赖于高质量、可验证的数据集,然而人工标注仍然成本高昂且难以扩展。当前的合成范式常面临一个反复出现的权衡:保持结构有效性通常会限制问题复杂度,而放宽约束以增加难度则常导致不一致或不可解的实例。为解决这一问题,我们提出智能提案框架,将问题合成建模为目标驱动的序列决策过程,其中专用智能体动态选择并组合模块化推理技能。通过内部反思与工具使用的迭代工作流,我们利用多粒度策略优化(MGPO)开发出Agentic-Proposer-4B模型,用于生成涵盖数学、编程和科学领域的高精度、可验证训练轨迹。实证结果表明,基于智能体合成数据训练的下游求解器显著优于主流基线模型,并展现出强大的跨领域泛化能力。值得注意的是,仅使用11,000条合成轨迹训练的30B求解器在AIME25数据集上达到了91.6%的最先进准确率,媲美GPT-5等前沿规模的专有模型,这证明少量高质量的合成信号能够有效替代海量人工标注数据集。