Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.
翻译:检索增强生成(RAG)已成为将语言模型锚定于外部知识的标准机制,然而,基于词汇或语义相似性的传统检索方法难以胜任复杂推理任务:语义相似的问题可能需要完全不同的解决策略,而表面不同的问题却可能共享相同的底层推理模式。为此,我们提出检索增强强化微调(RA-RFT)——一种后训练框架,旨在教会语言模型通过类比进行推理。RA-RFT采用黄金相关性蒸馏技术训练检索器,使其根据预期推理收益而非语义重叠对上下文进行排序;随后通过强化微调方法,利用检索到的类比范例对策略模型进行精调,使模型学会在可验证结果奖励的引导下利用推理轨迹。我们进一步分析了检索上下文的多样性,发现推理感知的检索能够发现互补的解决策略,为每个问题提供不同的推理支架。在具有挑战性的数学推理基准测试中,RA-RFT持续优于标准强化微调方法。例如,在AIME 2025的average@32准确率上,RA-RFT相比于GRPO在Qwen3-1.7B和Qwen3-4B上分别提升了7.1和2.8个百分点——这表明推理感知的检索是另一个可与奖励设计或训练课程改进正交的优化维度。