The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their safety alignment to yield harmful outputs. In this work, we extend the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense to VLMs and evaluate its performance against the JailBreakV-28K benchmark of multi-modal jailbreaking attacks. We find that RESTA is effective in reducing attack success rate over this diverse corpus of attacks, in particular, when employing directional embedding noise, where the injected noise is aligned with the original token embedding vectors. Our results demonstrate that RESTA can contribute to securing VLMs within agentic systems, as a lightweight, inference-time defense layer of an overall security framework.
翻译:视觉语言模型(VLMs)的安全性与可靠性是部署可信自主人工智能系统的关键组成部分。然而,VLMs 仍易受越狱攻击的影响,这些攻击会破坏其安全对齐机制,从而产生有害输出。在本研究中,我们将随机化嵌入平滑与令牌聚合(RESTA)防御方法扩展至 VLMs,并评估其在多模态越狱攻击基准 JailBreakV-28K 上的防御性能。我们发现,RESTA 能有效降低针对这一多样化攻击语料库的攻击成功率,尤其是在采用定向嵌入噪声时——即注入的噪声与原始令牌嵌入向量方向对齐。我们的结果表明,RESTA 可作为整体安全框架中一个轻量级的推理时防御层,为自主系统中的 VLMs 安全提供保障。