BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision-driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and VLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in VLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.

翻译：近年来，视觉语言模型（VLMs）的进展推动了具身智能体的发展，使其能够直接从视觉输入中感知、推理并规划面向任务的行为。然而，这种视觉驱动的具身智能体也开辟了新的攻击面：视觉后门攻击。在此类攻击中，智能体在场景中未出现视觉触发器时行为正常，而一旦触发器出现，便会持续执行攻击者指定的多步策略。我们提出了BEAT，这是首个利用环境中的物体作为触发器，向基于VLM的具身智能体注入此类视觉后门的框架。与文本触发器不同，物体触发器在视角和光照条件下表现出巨大差异，这使得可靠地植入后门变得困难。BEAT通过以下方式应对这一挑战：(1) 构建一个涵盖多样化场景、任务和触发器放置位置的训练集，使智能体充分暴露于触发器的变异性；(2) 引入一个两阶段训练方案，首先应用监督微调（SFT），然后采用我们新颖的对比触发学习（CTL）。CTL将触发器判别问题构建为存在触发器与无触发器输入之间的偏好学习，从而显式地锐化决策边界，以确保精确的后门激活。在多种具身智能体基准测试和VLM上，BEAT实现了高达80%的攻击成功率，同时保持了强大的良性任务性能，并能可靠地泛化到分布外的触发器放置位置。值得注意的是，与简单的SFT相比，在有限的后门数据下，CTL将后门激活准确率提升了高达39%。这些发现揭示了基于VLM的具身智能体中一个关键且尚未被探索的安全风险，强调了在实际部署前构建鲁棒防御机制的必要性。