Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.
翻译:大型语言模型通常通过多阶段后训练进行训练:首先通过RLHF,然后针对其他下游目标进行微调。然而,即使微小的下游更新也可能损害先前习得的行为(例如安全性),暴露出一种称为灾难性遗忘的脆弱性。这表明标准的RLHF目标无法保证对未来适应的鲁棒性。为解决此问题,先前大多数工作设计了下游阶段的方法来保留先前习得的行为。我们认为,防止这种情况需要预微调鲁棒性:基础策略应避免那些在标准微调下奖励急剧下降的脆弱高奖励解决方案。我们提出了微调鲁棒策略优化(FRPO),这是一个鲁棒的RLHF框架,它不仅优化当前策略下的奖励,还优化通过下游适应可达的KL有界策略邻域内的奖励。其核心思想是通过最大-最小化公式确保策略偏移下的奖励稳定性。通过修改GRPO,我们开发了一种无需额外计算的算法,并实证表明,在保持下游任务性能的同时,该方法显著减少了多种基础模型和下游微调机制(SFT和RL)下的安全性退化。我们进一步研究了一个以数学为重点的RL设置,证明FRPO在后续微调下能保持准确性。