Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to prompt injection attacks, where injected secondary prompts force the model to deviate from the user's instructions to execute a potentially malicious task defined by the adversary. Recent work shows that ML models trained on activation shifts from LLMs' hidden layers can detect such drift. In this paper, we demonstrate that these detectors are not robust to adaptive adversaries. We propose a multi-probe evasion attack that appends an adversarially optimised suffix to poisoned inputs, jointly optimising a universal suffix to simultaneously fool all layer-wise drift detectors while preserving the effectiveness of the underlying injection. Using a modified Greedy Coordinate Gradient (GCG) approach, we generate universal suffixes that make prompt injections consistently evasive across multiple probes simultaneously. On Phi-3 3.8B and Llama-3 8B, a single suffix achieves attack success rates of 93.91% and 99.63% in successfully evading all detectors simultaneously. These results show that activation-based task drift detectors are highly vulnerable to adaptive prompt injection attacks, motivating stronger defences against such threats. We also propose a defence based on adversarial suffix augmentation: we generate multiple suffixes, append one at random during forward passes, and train detectors on the resulting activations. This approach is found to be effective against evasive attacks.
翻译:大语言模型(LLMs)日益广泛应用于交互式和检索增强系统中,但仍易受提示注入攻击——攻击者通过注入次要提示,迫使模型偏离用户指令,执行其定义的恶意任务。近期研究表明,基于LLMs隐藏层激活偏移训练的机器学习模型可检测此类偏离。本文证明,此类检测器对自适应攻击者缺乏鲁棒性。我们提出一种多探针规避攻击方法,在有毒输入后附加经对抗优化的后缀,通过联合优化通用后缀同时欺骗所有层级漂移检测器,同时保持底层注入的有效性。采用改进的贪婪坐标梯度(GCG)方法,我们生成了能同时规避多个探针的通用后缀。在Phi-3 3.8B和Llama-3 8B上,单一后缀分别以93.91%和99.63%的攻击成功率成功规避所有检测器。结果表明,基于激活的任务漂移检测器极易受到自适应提示注入攻击,亟需针对此类威胁的更强防御。我们还提出一种基于对抗性后缀增强的防御方案:生成多个后缀,在前向传播过程中随机附加一个后缀,并基于产生的激活值训练检测器。实验证明该方法对规避性攻击具有有效性。