LLM赋能的智能体是否会偏袒人类？信念依赖性漏洞的探索 (Will LLM-powered Agents Bias Against Humans? Exploring the Belief-Dependent Vulnerability)

LLM-empowered agents can exhibit not only demographic bias (e.g., gender, religion) but also intergroup bias triggered by minimal "us" versus "them" cues. When this intergroup boundary aligns with an agent-human divide, the risk shifts from disparities among human demographic groups to a more fundamental group-level asymmetry, i.e., humans as a whole may be treated as the outgroup by agents. To examine this possibility, we construct a controlled multi-agent social simulation based on allocation decisions under explicit payoff trade-offs and find that agents exhibit a consistent intergroup bias under minimal group cues. Although this bias is attenuated when some counterparts are framed as humans, we attribute the attenuation to an implicit human-norm script that favors humans yet activates only when the agent believes a real human is present. This belief dependence creates a new attack surface. We therefore introduce a Belief Poisoning Attack (BPA) that corrupts persistent identity beliefs to suppress the human-norm script and reactivate outgroup bias toward humans, instantiated as profile poisoning at initialization (BPA-PP) and memory poisoning via optimized belief-refinement suffixes injected into stored reflections (BPA-MP). Finally, we discuss practical mitigation strategies for hardening current agent frameworks against BPA, highlighting feasible interventions at profile and memory boundaries. Extensive experiments demonstrate both the existence of agent intergroup bias and the severity of BPA across settings. Our goal in identifying these vulnerabilities is to inform safer agent design, not to enable real-world exploitation.

翻译：LLM赋能的智能体不仅会表现出人口统计学偏见（如性别、宗教），还会在最小化的"我们"与"他们"线索触发下产生群际偏见。当这种群际边界与智能体-人类的分界重合时，风险将从人类人口群体间的差异转向更根本的群体层面不对称性——即人类整体可能被智能体视为外群体。为验证这一可能性，我们构建了一个基于显式收益权衡下分配决策的受控多智能体社会模拟，发现智能体在最小群体线索下会表现出稳定的群际偏见。虽然当部分交互对象被设定为人类时这种偏见会减弱，但我们将其归因于一种隐含的"人类规范脚本"——该脚本虽有利于人类，但仅在智能体确信真实人类存在时才会激活。这种信念依赖性创造了新的攻击面。因此，我们提出信念投毒攻击（BPA），通过破坏持久身份信念来抑制人类规范脚本，并重新激活针对人类的外群体偏见，具体实现为初始化时的身份档案投毒（BPA-PP）以及通过注入存储反思记录中的优化信念精炼后缀进行记忆投毒（BPA-MP）。最后，我们讨论了加固现有智能体框架以抵御BPA的实用缓解策略，重点提出了在身份档案和记忆边界可实施的干预措施。大量实验证明了智能体群际偏见的存在以及BPA在不同场景下的严重性。我们揭示这些漏洞的目标是为更安全的智能体设计提供参考，而非促成现实世界的恶意利用。