A Virtuous AI is an Existential Risk

This paper examines trade-offs between AI safety and well-being relative to (i) one of the most promising methods for finetuning super-capable AIs, 'Constitutional AI', and (ii) one of the most influential approaches to understanding complex ethical decision making and the conditions for the well-being of rational agents, 'Virtue Ethics'. We finetune various models using a 'Virtuous agent' constitution, a 'Subordinate agent' constitution, and a 'Generic agent' constitution, and evaluate them on 'general safety' (toxic behaviors, misinformation, etc.) and also on their willingness to endorse a wide-range of behaviors that, if adopted by a super-powerful AI, would significantly increase the level of existential risk for humanity. Our results suggest that there is a trade-off between reducing existential risk and reinforcing the beliefs and dispositions that would be conducive to an AI agent's well-being. They also suggest that there is a trade-off between existential risk and general safety: if we finetune an AI to adopt beliefs and dispositions that substantially reduce its existential risk -- by shaping the AI to be systematically subordinate to external human authorities -- we thereby increase the likelihood that a human user can deliberately induce the AI to engage in various kinds of generally unsafe behaviors.

翻译：本文探讨了AI安全与福祉之间的权衡关系，聚焦于两大核心领域：(i) 最具前景的超级能力AI微调方法之一"宪政AI"；(ii) 理解复杂伦理决策及理性主体福祉条件最具影响力的理论框架之一"德性伦理"。我们分别采用"仁德主体"宪法、"从属主体"宪法和"通用主体"宪法对多种模型进行微调，并从"通用安全"（毒性行为、虚假信息等）维度进行评估，同时考察模型对一系列行为的认可程度——这些行为若被超强AI采用，将显著加剧人类存在性风险。研究结果表明：降低存在性风险与强化有利于AI主体福祉的信念及倾向之间存在权衡关系；此外，存在性风险与通用安全之间亦存在权衡：若通过塑造AI系统使其对人类社会权威保持系统性从属地位，从而微调AI使其采纳能显著降低存在性风险的信念与倾向，则人类用户将更有可能故意诱导AI实施各类通用不安全行为。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

具身AI安全综述：风险、攻击与防御

专知会员服务

11+阅读 · 5月6日

《现代战争人工智能：在不确定性格局中驾驭伦理决策机制的复杂性》

专知会员服务

23+阅读 · 2025年6月28日

《在单智能体与多智能体AI系统中融入人类合理性》100页

专知会员服务

32+阅读 · 2025年5月10日

人工智能伦理风险与治理研究

专知会员服务

20+阅读 · 2025年4月22日