Large Language Models (LLMs) are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 SSIM). Furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the LLM.
翻译:大语言模型(LLMs)正通过集成工具调用能力与多模态处理能力得到增强,这些新功能在带来便利的同时也引入了新的安全风险。本研究表明,攻击者可通过视觉对抗样本诱使受害LLM执行其意图中的工具操作,例如删除日历事件、泄露隐私对话或预订酒店。与既有攻击相比,本方法可在隐蔽性和多输入提示泛化性方面,影响与LLM相连用户资源的机密性与完整性。我们采用基于梯度的对抗训练构建攻击,并在多个维度上评估其性能特征。实验发现,生成的对抗图像能操纵LLM以近似真实语法(成功率约98%)调用工具,同时保持与干净图像的高度相似性(SSIM约0.9)。通过人工评分与自动化指标评估,证实此类攻击不会显著影响用户与LLM之间的对话内容及其语义。