Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer's view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish "self" from "peer", which forces equal weights on agent identity, thereby reducing bias and improving trustworthiness. Third, we define the Identity Bias Coefficient (IBC), a principled bias metric that measures an agent's tendency to follow its peer versus itself. Empirical studies across multiple models and benchmarks confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to ensure that MAD systems reason based on content rather than identity. Code is released in https://github.com/deeplearning-wisc/MAD-identity-bias.
翻译:多智能体辩论旨在通过让多个智能体交换答案并聚合其观点,来提升大语言模型的推理能力。然而,近期研究表明,智能体并非中立:它们容易受到身份驱动的谄媚倾向与自我偏见的影响,要么不加批判地采纳同侪观点,要么固执地坚持自己先前的输出,从而损害了辩论的可靠性。在本工作中,我们提出了首个将谄媚倾向与自我偏见相结合以缓解并量化多智能体辩论中身份偏见的理论框架。首先,我们将辩论动态形式化为一个身份加权的贝叶斯更新过程。其次,我们提出响应匿名化方法:通过从提示中移除身份标记,智能体无法区分“自身”与“同侪”,这迫使智能体身份权重均等,从而减少偏见并提升可信度。第三,我们定义了身份偏见系数,这是一个用于衡量智能体倾向于跟随同侪还是自身的原则性偏见度量指标。在多个模型与基准测试上的实证研究证实,身份偏见普遍存在,且谄媚倾向远较自我偏见更为常见。我们的发现凸显了确保多智能体辩论系统基于内容而非身份进行推理的必要性。代码发布于 https://github.com/deeplearning-wisc/MAD-identity-bias。