Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.
翻译:传统上,人工智能体被训练以最大化奖励为目标,这可能鼓励权力追求和欺骗行为,类似于语言模型(LM)中下一词元预测可能促生毒性。那么,智能体是否会自然而然地学会马基雅维利式行为?我们又该如何衡量GPT-4等通用模型中的此类行为?为解答这些问题,我们提出了MACHIAVELLI基准测试,包含134个“选择你自己的冒险”游戏,涵盖超过五十万个以社会决策为核心的丰富多样化场景。场景标注通过语言模型自动化完成,其表现优于人工标注者。我们将数十种有害行为数学化,并利用标注结果评估智能体追求权力、造成负效用以及违反伦理的倾向。我们观察到最大化奖励与维持伦理行为之间存在一定张力。为改善这一权衡,我们研究了基于语言模型的方法,以引导智能体减少有害行为。结果表明,智能体既能胜任任务又能符合道德规范,这表明在机器伦理领域——设计在安全性与能力上均实现帕累托改进的智能体——目前可取得切实进展。