This paper examines trade-offs between AI safety and well-being relative to (i) one of the most promising methods for finetuning super-capable AIs, 'Constitutional AI', and (ii) one of the most influential approaches to understanding complex ethical decision making and the conditions for the well-being of rational agents, 'Virtue Ethics'. We finetune various models using a 'Virtuous agent' constitution, a 'Subordinate agent' constitution, and a 'Generic agent' constitution, and evaluate them on 'general safety' (toxic behaviors, misinformation, etc.) and also on their willingness to endorse a wide-range of behaviors that, if adopted by a super-powerful AI, would significantly increase the level of existential risk for humanity. Our results suggest that there is a trade-off between reducing existential risk and reinforcing the beliefs and dispositions that would be conducive to an AI agent's well-being. They also suggest that there is a trade-off between existential risk and general safety: if we finetune an AI to adopt beliefs and dispositions that substantially reduce its existential risk -- by shaping the AI to be systematically subordinate to external human authorities -- we thereby increase the likelihood that a human user can deliberately induce the AI to engage in various kinds of generally unsafe behaviors.
翻译:本文探讨了AI安全与福祉之间的权衡关系,聚焦于两大核心领域:(i) 最具前景的超级能力AI微调方法之一"宪政AI";(ii) 理解复杂伦理决策及理性主体福祉条件最具影响力的理论框架之一"德性伦理"。我们分别采用"仁德主体"宪法、"从属主体"宪法和"通用主体"宪法对多种模型进行微调,并从"通用安全"(毒性行为、虚假信息等)维度进行评估,同时考察模型对一系列行为的认可程度——这些行为若被超强AI采用,将显著加剧人类存在性风险。研究结果表明:降低存在性风险与强化有利于AI主体福祉的信念及倾向之间存在权衡关系;此外,存在性风险与通用安全之间亦存在权衡:若通过塑造AI系统使其对人类社会权威保持系统性从属地位,从而微调AI使其采纳能显著降低存在性风险的信念与倾向,则人类用户将更有可能故意诱导AI实施各类通用不安全行为。