Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.
翻译:预训练语言模型通常通过强化微调(RFT)与人类偏好及下游任务对齐,即利用策略梯度算法最大化(可能经学习得到的)奖励函数。本研究揭示了RFT中一个根本性优化障碍:我们证明,当输入在模型下的奖励标准差较小时,即使其预期奖励远未达到最优,该输入的期望梯度也会消失。通过在RFT基准测试和受控环境下的实验,以及理论分析,我们进一步证明,由较小奖励标准差导致的梯度消失现象普遍存在且具有危害性,会引发极其缓慢的奖励最大化进程。最后,我们探索了克服RFT中梯度消失问题的方法。研究发现,初始监督微调(SFT)阶段是最有前景的候选方案,这揭示了其在RFT流程中的重要性。此外,我们证明仅需对少至1%的输入样本进行相对少量步数的SFT优化即可奏效,表明初始SFT阶段在计算和数据标注成本上无需过高投入。总体而言,我们的研究结果强调:在成功执行RFT时,必须密切关注那些以奖励标准差衡量的、期望梯度消失的输入样本。