Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. However, these methods can lead to ``safety interference'', where average-based safety constraints compromise the safety of some prompts in favor of others. To address this issue, we propose \textbf{Rectified Policy Optimization (RePO)}, which replaces the average safety constraint with stricter (per prompt) safety constraints. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments on Alpaca-7B demonstrate that RePO improves the safety alignment and reduces the safety interference compared to baseline methods. Code is available at https://github.com/pxyWaterMoon/RePO.
翻译:在大语言模型(LLM)对齐中,平衡有用性与安全性(无害性)是一个关键挑战。现有方法通常将这两个目标解耦,分别训练针对有用性和安全性的偏好模型,并将安全性作为约束马尔可夫决策过程(CMDP)框架中的约束条件进行处理。然而,这些方法可能导致“安全性干扰”问题,即基于平均值的约束会以牺牲部分提示的安全性为代价来满足其他提示的安全性要求。为解决此问题,我们提出了**修正策略优化(RePO)**,该方法将平均安全约束替换为更严格(基于每个提示)的安全约束。RePO的核心是由修正策略梯度驱动的策略更新机制,该机制对每个提示的严格安全违规进行惩罚,从而提升几乎所有提示的安全性。我们在Alpaca-7B上的实验表明,与基线方法相比,RePO改善了安全对齐效果并减少了安全性干扰。代码发布于 https://github.com/pxyWaterMoon/RePO。