Safety and trustworthiness are indispensable requirements for applying AI systems based on large language models (LLMs) in real-world applications. This paper formulates a human value alignment as a language model policy optimization problem to maximize reward under a safety constraint and then proposes an algorithm called Stepwise Alignment for Constrained Policy Optimization (SACPO). A key idea behind SACPO, supported by theory, is that the optimal policy incorporating both reward and safety can be directly obtained from a reward-aligned policy. Based on this key idea, SACPO aligns the LLMs with each metric step-wise while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO provides many benefits such as simplicity, stability, computational efficiency, and flexibility regarding algorithms and dataset selection. Under mild assumption, our theoretical analysis provides the upper bounds regarding near-optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness
翻译:安全性与可信赖性是将基于大语言模型(LLMs)的人工智能系统应用于现实场景时不可或缺的要求。本文将人类价值观对齐形式化为一个在安全约束下最大化奖励的语言模型策略优化问题,并提出名为“逐步对齐约束策略优化”(SACPO)的算法。SACPO的核心思想(已有理论支撑)在于:同时融合奖励与安全性的最优策略可直接从奖励对齐策略中获得。基于此思想,SACPO采用逐步方式对LLMs进行各指标对齐,同时利用直接偏好优化(DPO)等简单而强大的对齐算法。SACPO在算法简便性、稳定性、计算效率以及算法与数据集选择的灵活性方面具有显著优势。在温和假设下,我们的理论分析给出了关于近最优性与安全约束违反的上界。实验结果表明,SACPO在帮助性和无害性两方面均能比现有最优方法更好地微调Alpaca-7B模型。