Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $ξ:= H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $ξ$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($ξ\to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($ξ=Θ(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($ξ\to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $ξ\in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.
翻译:强化学习(RL)是提升大语言模型推理能力的主导范式,但其有效性在不同任务和计算预算下存在差异。我们提出一种**相对预算**理论,通过一个称为相对预算 $ξ:= H/\mathbb{E}[T]$ 的单一量来解释这种差异,其中 $H$ 是生成时域(令牌预算),$T$ 表示在基础策略下首次获得正确解所需的令牌数。我们证明 $ξ$ 通过控制奖励方差和信息轨迹的可能性来决定样本效率。我们的分析揭示了三种机制:在**不足**机制($ξ\to 0$)中,信息轨迹稀少,样本复杂度激增;在**平衡**机制($ξ=Θ(1)$)中,信息轨迹以不可忽略的概率出现,RL 达到最大样本效率;而在**充足**机制($ξ\to \infty$)中,学习保持稳定,但每次迭代的边际收益递减。我们进一步为在线 RL 提供了有限样本保证,刻画了这些机制下的学习进展。具体而言,在一个理想化分布假设下的案例研究中,我们表明相对预算随迭代次数线性增长。我们的实证结果在现实场景中证实了这些预测,识别出在 $ξ\in [1.5, 2.0]$ 的预算范围内学习效率最高,且与推理性能峰值相吻合。