How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{θ^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_θ$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_θ^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).

翻译：在仅依赖输出级监督的后训练阶段，当初始成功概率 $p_0$ 较小时，基于可验证奖励的强化学习（RLVR）会导致推理模型对新任务的适应陷入停滞。通过引入Tsallis $q$-对数，我们定义了一个损失函数族 $J_Q$，该函数族在RLVR（$q{=}0$，即利用极值点）与隐轨迹对数边际似然（$q{=}1$，即密度估计极值点）之间进行插值。所有成员共享相同的逐样本梯度方向，仅通过标量放大因子 $P_{θ^{-q}}$ 进行区分，该因子独立于学习率对每个实例重新加权。这一放大机制正是解决冷启动停滞的关键：在梯度流作用下，利用极值点需要 $Ω(\frac{1}{p_0})$ 时间才能脱离冷启动，而密度估计极值点仅需 $Θ\big(\log(\frac{1}{p_0})\big)$ 时间；中间 $q$ 值则可在逃逸速度与噪声记忆之间进行权衡。由于 $P_θ$ 难以直接计算，我们基于梯度的两种分解形式推导出两种蒙特卡洛估计器：梯度放大强化学习（GARL）从先验分布采样并放大强化学习梯度，后验衰减微调（PAFT）则从后验分布进行重要性重采样并执行标准SFT。两者均存在 $O\big(\frac{q}{M P_θ^{q+1}}\big)$ 的偏差，其中GARL方差更低，PAFT则具有语义连贯的梯度。在FinQA、HotPotQA和MuSiQue数据集上，当 $q{=}0.75$ 时，GARL显著缓解了冷启动停滞，在GRPO完全失效的场景下成功脱离冷启动。在热启动场景中，低 $q$ 值的GARL在训练稳定的FinQA上表现最优；而在HotPotQA和MuSiQue上，GARL在训练过程中出现不稳定，此时 $q{=}0.75$ 的PAFT提供了稳定的梯度（在HotPotQA上以47.9%的maj@16取得最优整体性能，较GRPO提升14.4个百分点）。