Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer the source of text based on how it sounds, not where it actually comes from. A command hidden in a webpage hijacks an agent simply because it sounds like a user instruction. This is not just behavioral: in the model's internal representations, text that sounds like a trusted source occupies the same space as text that actually is one. We design role probes which measure how models internally perceive "who is speaking", showing that attacker-controllable signals (e.g. syntactic patterns, lexical choice) control role perception. We first test this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts or ingested webpages. Models mistake the text for their own thoughts, yielding 60% attack success on StrongREJECT across frontier models with near-0% baselines. Strikingly, the degree of role confusion strongly predicts attack success. We then generalize these results to standard agent prompt injections, introducing a unifying framework that reframes prompt injection not as an ad-hoc exploit but as a measurable consequence of how models represent role.
翻译:尽管经过广泛的安全训练,语言模型仍易受提示注入攻击。我们将此缺陷归因于"角色混淆":模型根据文本的措辞风格而非实际来源推断其出处。隐藏在网页中的指令之所以能劫持智能体,仅仅因为其措辞风格类似用户指令。这不仅是行为层面的问题:在模型内部表征中,风格类似可信来源的文本与真实可信来源的文本占据相同的表征空间。我们设计了"角色探针"来测量模型内部如何感知"谁在说话",证明攻击者可控信号(如句法模式、词汇选择)能够控制角色感知。我们首先通过"思维链伪造"攻击测试该假设——这是一种零样本攻击,将捏造的推理过程注入用户提示或已摄取网页。模型将这些文本误判为自身思维,针对前沿模型的StrongREJECT攻击成功率高达60%,而基线攻击成功率接近0%。值得注意的是,角色混淆程度与攻击成功率呈现强相关性。随后我们将此结论推广至标准智能体提示注入,提出统一框架,将提示注入重新定义为模型表征角色方式的可量化后果,而非临时性漏洞利用行为。