Ensuring that AI agents behave safely and beneficially when interacting with other parties has emerged as one of the central challenges of modern AI safety. While mechanism design, as the theory of designing rules to align individual and collective objectives, can incentivize cooperative behavior, it is still an open question whether it alone is sufficient to maximize LLM agents' social welfare. This work proves that the answer is negative: drawing from incomplete contract theory, we formally show that when contracts cannot distinguish all relevant future contingencies, there is a strictly positive welfare loss that no realistic mechanism can eliminate. We show that prosocial agents, who weigh others' welfare alongside their own, can close this gap and achieve outcomes that are socially superior and individually beneficial. Experimentally, we show that in multi-agent resource-allocation environments and canonical social dilemmas where agents are powered by large language models, prosociality is beneficial. The implication for AI safety is clear: to enable cooperative interactions at scale, designing adequate mechanisms is not sufficient; agents must be built to be intrinsically prosocial.
翻译:确保人工智能智能体在与其它主体交互时表现出安全且有益的行为,已成为现代人工智能安全领域的核心挑战之一。尽管机制设计作为设计规则以协调个体与集体目标的理论,能够激励合作行为,但其本身是否足以最大化大型语言模型智能体的社会总福利仍是一个开放性问题。本研究证明答案是否定的:基于不完全合同理论,我们形式化地证明,当合同无法区分所有相关的未来偶发事件时,存在任何现实机制都无法消除的正福利损失。我们表明,能够权衡他人福祉与自身利益的亲社会智能体可以弥合这一差距,实现社会更优且个体有益的结果。实验方面,我们证明在大语言模型驱动的多智能体资源分配环境及经典社会困境中,亲社会性具有正向作用。这对人工智能安全的启示是明确的:要实现大规模的合作交互,设计充分的机制并不足够;智能体必须被构建为具备内在亲社会性。