Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.
翻译:多轮提示注入遵循已知攻击路径——建立信任、转向、升级——但文本级防御难以发现单轮看似无害的隐蔽攻击。我们证明该攻击路径在模型残差流中留下激活级特征:每个阶段转换都会改变激活状态,产生的总路径长度远超正常对话。我们将此现象称为对抗性躁动。捕捉该信号的五个标量轨迹特征,在合成保留数据上将对话级检测率从76.2%提升至93.8%。该信号在四个模型系列(24B-70B)中具有可复现性;探测方法具有模型特异性,无法跨架构迁移。泛化能力具有数据源依赖性:留一源评估表明,合成数据、LMSYS-Chat-1M和SafeDialBench各自捕获不同的攻击分布,当真实世界LMSYS数据的分布出现在训练集中时,其检测率可达47-71%。三源联合训练在保留混合集上实现89.4%的检测率(假阳性率2.4%)。我们进一步证明合成数据集特有的三阶段轮次标签(良性/转向/攻击)至关重要:二元对话级标签会产生50-59%的假阳性率。这些结果确立了对抗性躁动作为可靠激活级信号的地位,并刻画了实际部署所需的数据要求。