You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

翻译：基于可验证奖励的强化学习已成为提升大语言模型推理能力的主流范式，但其产生的参数轨迹的底层几何结构仍未被充分探索。本研究表明，RLVR权重轨迹具有极低秩且高度可预测的特性。具体而言，我们发现大部分下游性能提升可由参数增量的秩1近似捕获，且该投影分量的幅度随训练步数近似线性演化。受此启发，我们提出一种简单且计算高效的方法RELEX，该方法只需从短观察窗口估计秩1子空间，并通过线性回归外推未来检查点，无需任何学习模型。在三个模型上，RELEX产生的检查点在领域内和领域外基准测试中均达到或超越RLVR性能，且仅需完整RLVR训练步数的15%。值得注意的是，RELEX能在零训练成本下将外推范围远超观察窗口，预测的检查点步数可达观测前缀的10-20倍，且性能持续提升。消融实验证实了RELEX的极简充分性：无论是增加子空间秩还是采用非线性建模，均无法带来外推性能的进一步改善。最后，我们证明RELEX的成功源于"去噪"效应：通过将更新投影到秩1子空间，模型丢弃了随机优化噪声——这种噪声会劣化外推过程中的性能表现。我们的代码已开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

迈向大推理模型的机理理解：关于训练、推理及失效模式的综述

专知会员服务

17+阅读 · 1月29日

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

强化学习遇见大语言模型：贯穿 LLM 生命周期的进展与应用综述

专知会员服务

38+阅读 · 2025年9月23日

强化多模态大语言模型：基于强化学习的推理综述

专知会员服务

37+阅读 · 2025年5月3日