Bias in large language models (LLMs) remains a persistent challenge, often leading to stereotyping and unfair treatment across social groups. While prior work has mainly focused on individual LLMs, the emergence of multi-agent systems (MAS), where multiple LLMs collaborate and communicate, introduces new and underexplored dynamics in how bias emerges, propagates, and amplifies. To systematically investigate these dynamics, we propose a simple evaluation framework with three agent-level metrics that quantify bias emergence, propagation, and amplification throughout multi-agent interaction. We evaluate MAS across three bias benchmarks under varying LLM backbones, social-group configurations, communication behaviors, and adversarial settings. Our results show that communication can trigger up to 70\% new bias emergence, propagate bias across over 80\% of agents, and amplify stereotypes by more than 3$\times$. We further find that denser and competitive communication generally increases bias. Finally, we demonstrate that MAS are highly vulnerable to simple bias injection attacks, and existing defense strategies provide only limited protection. Our findings provide important insights into the fairness and robustness of multi-agent LLM systems.
翻译:大规模语言模型中的偏见仍是一个持续挑战,常导致跨社会群体的刻板印象与不公正对待。虽然以往研究主要聚焦于单个语言模型,但多个语言模型协作通信的多智能体系统崛起,引入了偏见如何涌现、传播与放大的新动态——这一领域尚未充分探索。为系统研究这些动态,我们提出一个简洁的评估框架,包含三个智能体层级指标,用以量化多智能体交互中的偏见涌现、传播与放大。我们基于三种偏见基准,在不同语言模型骨干、社会群体配置、通信行为及对抗性设置下对多智能体系统进行评估。结果表明:通信可触发高达70%的新偏见涌现,使偏见传播至超过80%的智能体,并将刻板印象放大3倍以上。进一步发现,更密集且竞争性的通信通常加剧偏见。最终,我们证明多智能体系统极易受简单偏见注入攻击,且现有防御策略仅提供有限保护。这些发现为多智能体语言模型系统的公平性与鲁棒性提供了重要洞见。