Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.
翻译:语言模型(LM)越来越多地应用于高风险、多智能体场景,在此类场景中,遵循指令并维持价值对齐至关重要。多数对齐研究聚焦于单个LM与单个用户之间的交互,未能解决多轮交互中多个LM间不当行为传播的风险。我们发现这一现象(称为对齐偏差传播)的证据存在于多个LM进行多轮对话式社会困境游戏的过程中。具体而言,我们观察到LM在游戏后表现出更强的反社会倾向,且当其他玩家被引导采取恶意行为时,这种效应会加剧。我们探索了不同引导技术以缓解此类对齐偏差传播,发现强化LM的系统提示词效果不足甚至有害。为此,我们提出"隐式特质引导"技术:该技术间歇性地向系统提示词注入强化LM初始特质的陈述,比单纯重复系统提示词更能有效保持模型与初始亲社会行为的一致性。重要的是,该方法无需访问模型参数或内部状态,因而适用于当前日益普遍的基于黑盒模型设计复杂多智能体工作流的场景。