Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment

Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

翻译：确保安全性是大语言模型的基本要求。在提升模型输出的实用性与其潜在危害的缓解之间取得适当平衡，是一项复杂且持续的挑战。现代方法通常将这一问题形式化为约束马尔可夫决策过程，并采用成熟的约束马尔可夫决策过程优化技术。然而，这些方法存在两个显著局限性。首先，它们依赖奖励和成本函数，使得性能高度依赖于底层评分机制。该机制必须捕捉语义含义，而非由表面关键词触发。其次，基于约束马尔可夫决策过程的训练需要调整双变量，这一过程不仅计算成本高昂，而且在固定的双变量下无法提供任何可证明的安全保证，从而可能被对抗性越狱攻击利用。为克服这些局限性，我们引入了可认证安全强化学习，该系统在大规模语料上训练成本模型，以分配具有语义基础的安全分数。与基于拉格朗日乘子的方法不同，可认证安全强化学习采用修正的基于惩罚的公式。该设计借鉴了约束优化中精确惩罚函数理论，其中约束满足通过适当选择的惩罚项直接强制执行。在适当缩放惩罚因子的情况下，可以在优化器上保证安全约束的可行性，从而消除对双变量更新的需求。实证评估表明，可认证安全强化学习在应对常规提示和越狱提示时，其模型响应效率至少比最先进的大语言模型高出五倍。