Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.
翻译:尽管经过广泛的安全训练,语言模型仍然容易受到提示注入攻击。我们将这一失效归因于角色混淆:模型从文本的书写方式推断角色,而非文本的来源。我们设计了新型角色探针来捕捉模型如何在内部识别“谁在说话”。这些探针揭示了提示注入成功的原因:模仿角色的不可信文本会继承该角色的权威性。我们通过在用户提示和工具输出中注入伪造推理来验证这一发现,在StrongREJECT基准测试中平均达到60%的成功率,在代理外泄任务中达到61%的成功率,涉及多个开源和闭源模型,基线接近零。引人注目的是,内部角色混淆的程度在生成开始前就能强有力地预测攻击成功。我们的发现揭示了一个根本性差距:安全性定义在接口层面,但权威性却分配在潜在空间中。更广泛而言,我们为提示注入引入了一个统一的、基于机制的框架,证明多种多样化的提示注入攻击都利用了相同的底层角色混淆机制。