Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.
翻译:大型推理模型在复杂问题上表现出色,但其训练效率面临关键瓶颈:基于强化学习的训练需要长序列展开以获得结果导向的奖励,而自回归解码过程占据了主要的时间和内存消耗。虽然滑动窗口缓存策略可以限制内存使用,但会破坏长上下文推理能力并导致性能下降。本文提出渐进式思维编码方法,这是一种参数高效的微调技术,使大型推理模型能够在固定大小缓存下进行有效推理。通过将中间推理过程逐步编码为固定大小的向量表示,我们的方法避免了在全缓存展开序列上进行反向传播的需求,从而显著降低内存使用量,并在推理阶段保持恒定内存占用。在Qwen2.5-3B-Instruct、Qwen2.5-7B-Instruct和DeepSeek-R1-Distill-Llama-8B三个模型上进行的实验,覆盖六个广泛使用的挑战性数学基准测试,均显示出稳定的性能提升:在相同严格缓存限制下,本方法相比基于LoRA的微调平均提升19.3%,相比未微调的大型推理模型平均提升29.9%,在AIME2024/2025测试中最高获得23.4个百分点的准确率提升。这些结果表明,渐进式思维编码不仅能提升推理准确率,还能在实际内存限制下显著提高大型推理模型强化学习训练的效率和可扩展性。