The AI Alignment Paradox

The field of AI alignment aims to steer AI systems toward human goals, preferences, and ethical principles. Its contributions have been instrumental for improving the output quality, safety, and trustworthiness of today's AI models. This perspective article draws attention to a fundamental challenge we see in all AI alignment endeavors, which we term the "AI alignment paradox": The better we align AI models with our values, the easier we may make it for adversaries to misalign the models. We illustrate the paradox by sketching three concrete example incarnations for the case of language models, each corresponding to a distinct way in which adversaries might exploit the paradox. With AI's increasing real-world impact, it is imperative that a broad community of researchers be aware of the AI alignment paradox and work to find ways to mitigate it, in order to ensure the beneficial use of AI for the good of humanity.

翻译：人工智能对齐领域致力于引导人工智能系统遵循人类目标、偏好与伦理准则。该领域的贡献对提升当前人工智能模型的输出质量、安全性与可信度具有关键作用。本文聚焦于所有人工智能对齐实践中存在的基础性挑战，我们称之为"人工智能对齐悖论"：我们让人工智能模型与人类价值观的对齐程度越高，攻击者就越容易使模型偏离对齐状态。我们通过语言模型的三个具体实例来阐释这一悖论，每个实例对应攻击者可能利用该悖论的不同方式。随着人工智能对现实世界的影响日益增强，广泛的研究群体必须认识到人工智能对齐悖论的存在，并致力于寻找缓解该悖论的方案，以确保人工智能为人类福祉发挥积极作用。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日