Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
翻译:大视觉语言模型(如GPT-4)在多模态任务中已展现出卓越性能,但其部署需消耗大量能源与计算资源。若攻击者在模型推理阶段恶意诱导高能耗与高延迟(即能耗-延迟成本),将导致计算资源枯竭。本文聚焦此类攻击对VLM可用性的影响,旨在诱导模型推理时产生高能耗-延迟成本。研究发现,通过最大化生成序列长度可操控VLM推理过程中的能耗-延迟成本。为此,我们提出"冗余图像"方法——通过构建难以察觉的扰动,诱导VLM推理时生成长句序列。具体而言,我们设计了三种损失函数:首先,提出延迟序列结束标记(EOS token)出现的损失函数,EOS标记是VLM停止生成后续标记的信号;其次,分别提出不确定性损失与标记多样性损失,前者用于增加每个生成标记的不确定性,后者用于提升整个生成序列中所有标记的多样性,从而在标记级与序列级打破输出依赖性;最后,提出时间权重调整算法以有效平衡上述损失。大量实验表明,在MS-COCO与ImageNet数据集上,与原始图像相比,我们的冗余图像可使生成序列长度分别提升7.87倍与8.56倍,这对各类应用构成潜在挑战。代码开源地址:https://github.com/KuofengGao/Verbose_Images。