Ensuring that AI agents behave safely and beneficially when interacting with other parties has emerged as one of the central challenges of modern AI safety. While mechanism design, as the theory of designing rules to align individual and collective objectives, can incentivize cooperative behavior, it is still an open question whether it alone is sufficient to maximize LLM agents' social welfare. This work proves that the answer is negative: drawing from incomplete contract theory, we formally show that when contracts cannot distinguish all relevant future contingencies, there is a strictly positive welfare loss that no realistic mechanism can eliminate. We show that prosocial agents, who weigh others' welfare alongside their own, can close this gap and achieve outcomes that are socially superior and individually beneficial. Experimentally, we show that in multi-agent resource-allocation environments and canonical social dilemmas where agents are powered by large language models, prosociality is beneficial. The implication for AI safety is clear: to enable cooperative interactions at scale, designing adequate mechanisms is not sufficient; agents must be built to be intrinsically prosocial.
翻译:确保AI智能体在与他人互动时行为安全且有益,已成为现代AI安全的核心挑战之一。虽然机制设计作为设计规则以协调个体与集体目标的理论,能够激励合作行为,但仅凭它是否足以最大化LLM智能体的社会福利,仍是一个悬而未决的问题。本研究证明答案是否定的:借鉴不完全契约理论,我们正式证明,当契约无法区分所有相关的未来偶发事件时,将存在严格为正的福利损失,而任何现实机制都无法消除这种损失。我们表明,亲社会智能体——即在关注自身福利的同时也权衡他人福利的智能体——能够弥补这一差距,实现既对社会有利又对个体有益的成果。实验上,我们在基于大语言模型驱动的多智能体资源分配环境及经典社会困境中证明,亲社会性是有益的。这对AI安全的启示很明确:要实现大规模合作互动,设计充分的机制并不足够;必须将智能体构建为内在地具有亲社会性。