INFUSER: Influence-Guided Self-Evolution Improves Reasoning

Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

翻译：自演化提供了一条可扩展的强推理能力提升路径：预训练语言模型仅需极少量外部监督即可实现自我改进。然而现有方法或依赖精心整理的数据集或教师生成的训练数据，或当生成器无监督运行时，仅通过难度启发式规则给予奖励——这种规则未必能提升求解器性能。本文提出INFUSER，一种包含两种协同演化角色的迭代协同训练框架：生成器从非结构化自动采集的文档池中起草问题与参考答案，求解器则通过在其上进行训练实现改进。求解器以生成器提供的答案为基准，通过标准正确性奖励进行训练；而生成器则通过一种优化器感知的影响力分数获得奖励——该分数衡量每个候选问题是否真正能提升求解器在目标分布上的表现。由于标准GRPO难以处理这种连续且带有噪声的影响力分数，我们提出DuGRPO（一种GRPO的双重归一化变体）用于生成器训练。这些机制共同将文档池转化为自适应课程，优先选择对当前求解器有用的难题而非单纯的高难度题目。在Qwen3-8B-Base上，INFUSER相较强自演化基线在奥林匹克数学与SuperGPQA基准测试中实现超过20%的相对提升，并且8B参数的INFUSER协同演化生成器在数学与编程任务上超越冻结参数的32B推理生成器。消融实验验证了每项设计选择的必要性，两项扩展实验——将INFUSER应用于指令微调锚点模型以及结合规则可验证RLVR数据增强——进一步证明了该框架的灵活性与泛化性。代码已发布于https://github.com/FFishy-git/INFUSER。

相关内容

生成器

关注 2

生成器是一次生成一个值的特殊类型函数。可以将其视为可恢复函数。调用该函数将返回一个可用于生成连续 x 值的生成【Generator】，简单的说就是在函数的执行过程中，yield语句会把你需要的值返回给调用生成器的地方，然后退出函数，下一次调用生成器函数的时候又从上次中断的地方开始执行，而生成器内的所有变量参数都会被保存下来供下一次使用。

BES：让语言模型通过双向进化搜索自我改进

专知会员服务

9+阅读 · 5月30日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日

不可错过！首门《自监督学习统计模型》课程！霍普金斯Daniel Khashabi讲授

专知会员服务

24+阅读 · 2022年9月30日