Vision Language Action (VLA) models close the perception action loop by translating multimodal instructions into executable behaviors, but this very capability magnifies safety risks: jailbreaks that merely yield toxic text in LLMs can trigger unsafe physical actions in embodied systems. Existing defenses alignment, filtering, or prompt hardening intervene too late or at the wrong modality, leaving fused representations exploitable. We introduce a concept-based dictionary learning framework for inference-time safety control. By constructing sparse, interpretable dictionaries from hidden activations, our method identifies harmful concept directions and applies threshold-based interventions to suppress or block unsafe activations. Experiments on Libero-Harm, BadRobot, RoboPair, and IS-Bench show that our approach achieves state-of-the-art defense performance, cutting attack success rates by over 70\% while maintaining task success. Crucially, the framework is plug-in and model-agnostic, requiring no retraining and integrating seamlessly with diverse VLAs. To our knowledge, this is the first inference-time concept-based safety method for embodied systems, advancing both interpretability and safe deployment of VLA models.
翻译:视觉语言动作(VLA)模型通过将多模态指令转化为可执行行为来闭合感知-动作环路,但这一能力也放大了安全风险:在大型语言模型中仅产生有害文本的越狱攻击,在具身系统中可能触发不安全的物理动作。现有的防御方法(如对齐、过滤或提示强化)干预过晚或在错误模态上进行,使得融合后的表征仍可被利用。我们提出一种基于概念的词典学习框架,用于推理时的安全控制。该方法通过从隐藏层激活中构建稀疏、可解释的词典,识别有害概念方向,并应用基于阈值的干预来抑制或阻断不安全激活。在Libero-Harm、BadRobot、RoboPair和IS-Bench上的实验表明,我们的方法实现了最先进的防御性能,将攻击成功率降低超过70%,同时保持任务成功率。关键的是,该框架为即插即用且模型无关,无需重新训练,可与多种VLA模型无缝集成。据我们所知,这是首个面向具身系统的推理时基于概念的安全方法,推动了VLA模型的可解释性与安全部署。