Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

Vision-Language Agentic Systems (VLAS) connect visual perception to planning, tool use, and physical actions. This means backdoor-type triggers can propagate through both decision pipelines and their connected interfaces, thus making visual backdoors a system-level threat. Current evaluations on such backdoors focus on clean accuracy and attack success rate (ASR), metrics that capture whether a trigger works, but not whether an attack is actually "precise" -- i.e. whether it triggers hidden behaviors only when intended. In this work, we formalize the failure of trigger precision as "trigger leakage": inputs that are visually or semantically close to the intended trigger and therefore inadvertently activate the attacker-specified behavior. To quantify this leakage, we introduce Neighbor Leakage Rate (NLR). Our experiments show that at a 3% poisoning ratio, icon and text triggers remain robust to common visual transformations, but their neighboring variants leak heavily, with NLR reaching 0.996 (icon) and 0.944 (text). Using textual triggers as a controlled probe, we show that standard fine-tuning learns a broad activation region rather than an exact trigger condition, causing neighboring strings to invoke the malicious behavior even when the exact trigger is absent. Adding edit-distance-one hard-negative samples during training substantially narrows this activation region and reduces leakage, including in image-editing and embodied-manipulation workflows, where leaked triggers can propagate into executable programs and action sequences.

翻译：视觉-语言智能系统（VLAS）将视觉感知与规划、工具使用及物理动作相结合。这意味着后门型触发器可通过决策管道及其连接的接口传播，从而使视觉后门成为系统级威胁。当前对此类后门的评估主要关注干净准确率和攻击成功率（ASR），这些指标仅衡量触发器是否生效，但未评估攻击是否真正“精准”——即是否仅在预期条件下触发了隐藏行为。在本工作中，我们将触发器精准度的失效形式化定义为“触发器泄露”：即视觉或语义上接近预期触发器的输入，会非故意地激活攻击者指定的行为。为量化此类泄露，我们引入了邻居泄露率（NLR）。实验表明，在3%的投毒比例下，图标和文本触发器对常见视觉变换保持鲁棒性，但其邻居变体严重泄露，NLR分别达到0.996（图标）和0.944（文本）。通过将文本触发器作为受控探针，我们发现标准微调会学习到宽泛的激活区域而非精确的触发条件，导致即使缺失精确触发器，邻近字符串仍可引发恶意行为。在训练中加入编辑距离为1的困难负样本，能够显著收窄该激活区域并减少泄露，包括在图像编辑和具身操作工作流中——这些场景下泄露的触发器可能传播至可执行程序与动作序列。