Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \emph{dissociative}: they are essentially an assemblage of mutable modules -- foundation models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.

翻译：随着自主语言模型代理的涌现，一个具有现实世界影响的新兴代理网络正在形成。在此背景下，你可以使用哪些可信度信号来决定是否信任并委托陌生的野外代理？一种自然的治理直觉是将人类身份验证和声誉机制从“了解你的客户”和信用评分扩展到“了解你的代理”体系。然而，我们认为这种类比在根本上是不完整的。声誉机制既作为社会信号，也作为维持可信行为均衡的纠正性反馈，其前提是存在一个具有行为连续性、制裁敏感性和不可替代性的持久身份。但语言模型代理在本体论上是“分离性”的：它们本质上是可变模块的集合——基础模型、系统提示、工具访问策略、外部记忆，在某些情况下甚至是一个整体的多代理系统——其中任何模块都可能改变代理行为，且其流动的人格不仅容易受到对抗性攻击，也可能无法内化制裁。借鉴分离性身份障碍的法理学，这种分离性使代理缺乏可识别性、可预测性、可信度和可恢复性的基础，而这正是声誉机制旨在维持的属性，从而导致信任的崩溃。我们认为，基于身份、事后、监管、以制裁为基础的治理（如声誉）在结构上不适用于分离性代理，并建议转向基于可观察性、事前、构成性、以协议为基础的行为约束机制。