The field of AI alignment aims to steer AI systems toward human goals, preferences, and ethical principles. Its contributions have been instrumental for improving the output quality, safety, and trustworthiness of today's AI models. This perspective article draws attention to a fundamental challenge we see in all AI alignment endeavors, which we term the "AI alignment paradox": The better we align AI models with our values, the easier we may make it for adversaries to misalign the models. We illustrate the paradox by sketching three concrete example incarnations for the case of language models, each corresponding to a distinct way in which adversaries might exploit the paradox. With AI's increasing real-world impact, it is imperative that a broad community of researchers be aware of the AI alignment paradox and work to find ways to mitigate it, in order to ensure the beneficial use of AI for the good of humanity.
翻译:人工智能对齐领域致力于引导人工智能系统遵循人类目标、偏好与伦理准则。该领域的贡献对提升当前人工智能模型的输出质量、安全性与可信度具有关键作用。本文聚焦于所有人工智能对齐实践中存在的基础性挑战,我们称之为"人工智能对齐悖论":我们让人工智能模型与人类价值观的对齐程度越高,攻击者就越容易使模型偏离对齐状态。我们通过语言模型的三个具体实例来阐释这一悖论,每个实例对应攻击者可能利用该悖论的不同方式。随着人工智能对现实世界的影响日益增强,广泛的研究群体必须认识到人工智能对齐悖论的存在,并致力于寻找缓解该悖论的方案,以确保人工智能为人类福祉发挥积极作用。