Social Contract AI: Aligning AI Assistants with Implicit Group Norms

We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.

翻译：我们探索了通过从观察到的交互中反向推导用户（未知）偏好模型来对齐AI助手的方法。为验证这一设想，我们在经济博弈中的最后通牒游戏中开展了概念验证模拟，将用户偏好形式化为指导模拟玩家行为的策略。研究发现，AI助手能够准确调整其行为以匹配经济学文献中的标准策略（如自私、利他）。然而，当面对训练分布中未曾出现的货币单位（例如药品克数）时，学习到的策略缺乏鲁棒性，在分布外场景中泛化能力有限。此外，当语言使用与未知策略（如利他策略与粗鲁语言结合）的关系存在不一致性时，助手对策略的学习速度会放缓。总体而言，我们的初步结果表明，构建AI助手需从不同用户推断偏好的模拟框架，可为研究实际对齐问题提供有价值的途径。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日