Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{θ^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_θ$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_θ^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).
翻译:在仅依赖输出级监督的后训练阶段,当初始成功概率 $p_0$ 较小时,基于可验证奖励的强化学习(RLVR)会导致推理模型对新任务的适应陷入停滞。通过引入Tsallis $q$-对数,我们定义了一个损失函数族 $J_Q$,该函数族在RLVR($q{=}0$,即利用极值点)与隐轨迹对数边际似然($q{=}1$,即密度估计极值点)之间进行插值。所有成员共享相同的逐样本梯度方向,仅通过标量放大因子 $P_{θ^{-q}}$ 进行区分,该因子独立于学习率对每个实例重新加权。这一放大机制正是解决冷启动停滞的关键:在梯度流作用下,利用极值点需要 $Ω(\frac{1}{p_0})$ 时间才能脱离冷启动,而密度估计极值点仅需 $Θ\big(\log(\frac{1}{p_0})\big)$ 时间;中间 $q$ 值则可在逃逸速度与噪声记忆之间进行权衡。由于 $P_θ$ 难以直接计算,我们基于梯度的两种分解形式推导出两种蒙特卡洛估计器:梯度放大强化学习(GARL)从先验分布采样并放大强化学习梯度,后验衰减微调(PAFT)则从后验分布进行重要性重采样并执行标准SFT。两者均存在 $O\big(\frac{q}{M P_θ^{q+1}}\big)$ 的偏差,其中GARL方差更低,PAFT则具有语义连贯的梯度。在FinQA、HotPotQA和MuSiQue数据集上,当 $q{=}0.75$ 时,GARL显著缓解了冷启动停滞,在GRPO完全失效的场景下成功脱离冷启动。在热启动场景中,低 $q$ 值的GARL在训练稳定的FinQA上表现最优;而在HotPotQA和MuSiQue上,GARL在训练过程中出现不稳定,此时 $q{=}0.75$ 的PAFT提供了稳定的梯度(在HotPotQA上以47.9%的maj@16取得最优整体性能,较GRPO提升14.4个百分点)。