A key challenge for the safety of advanced AI systems is the possibility that multiple simpler agents might inadvertently form a collective agent with capabilities and goals distinct from those of any individual. More generally, determining when a group of agents can be viewed as a unified collective agent is a foundational question in the study of interactions and incentives in both biological and artificial systems. We adopt a behavioral perspective in answering this question, ascribing collective agency to a group when viewing the group's joint actions as rational and goal-directed successfully predicts its behavior. We formalize this perspective on collective agency using causal games -- which are causal models of strategic, multi-agent interactions -- and causal abstraction -- which formalizes when a simple, high-level model faithfully captures a more complex, low-level model. We use this framework to solve a puzzle regarding multi-agent incentives in actor-critic models and to make quantitative assessments of the degree of collective agency exhibited by different voting mechanisms. Our framework aims to provide a foundation for theoretical and empirical work to understand, predict, and control emergent collective agents in multi-agent AI systems.
翻译:先进AI系统安全性的一个关键挑战在于,多个简单智能体可能无意中形成一个集体能动体,其能力和目标与任何个体都不同。更一般地说,判断一组智能体何时可被视为统一集体能动体,是研究生物与人工系统中交互与激励的基础性问题。我们采用行为主义视角回答这一问题:当将群体的联合行动视为理性且目标导向时,若其能成功预测行为,则赋予该群体集体能动性。我们利用因果博弈(一种战略多智能体交互的因果模型)和因果抽象(一种形式化描述简单高层模型如何忠实捕捉复杂低层模型的方法)来形式化这一集体能动性视角。通过该框架,我们解决了演员-评论家模型中多智能体激励的疑难问题,并对不同投票机制所体现的集体能动性程度进行量化评估。本框架旨在为理解、预测和控制多智能体AI系统中涌现的集体能动体提供理论与实证研究基础。