Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.
翻译:基于可验证奖励的强化学习已成为提升大语言模型推理能力的主流范式,但其产生的参数轨迹的底层几何结构仍未被充分探索。本研究表明,RLVR权重轨迹具有极低秩且高度可预测的特性。具体而言,我们发现大部分下游性能提升可由参数增量的秩1近似捕获,且该投影分量的幅度随训练步数近似线性演化。受此启发,我们提出一种简单且计算高效的方法RELEX,该方法只需从短观察窗口估计秩1子空间,并通过线性回归外推未来检查点,无需任何学习模型。在三个模型上,RELEX产生的检查点在领域内和领域外基准测试中均达到或超越RLVR性能,且仅需完整RLVR训练步数的15%。值得注意的是,RELEX能在零训练成本下将外推范围远超观察窗口,预测的检查点步数可达观测前缀的10-20倍,且性能持续提升。消融实验证实了RELEX的极简充分性:无论是增加子空间秩还是采用非线性建模,均无法带来外推性能的进一步改善。最后,我们证明RELEX的成功源于"去噪"效应:通过将更新投影到秩1子空间,模型丢弃了随机优化噪声——这种噪声会劣化外推过程中的性能表现。我们的代码已开源。