Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
翻译:大型视觉语言模型(如GPT-4)在多模态任务中展现出卓越性能。然而,其部署需要消耗大量能源和计算资源。若攻击者恶意诱导模型在推理过程中产生高能耗和高延迟(即能耗-延迟成本),将导致计算资源枯竭。本文针对视觉语言模型的可用性攻击面展开探索,旨在诱导模型在推理阶段产生高能耗-延迟成本。研究发现,通过最大化生成序列的长度可操纵模型推理过程中的能耗-延迟成本。为此,我们提出"冗长图像"方法,通过构建不易察觉的扰动促使模型在推理时生成冗长句子。具体而言,我们设计了三个损失目标:首先,提出延迟序列结束(EOS)标记出现的损失函数(EOS标记是模型停止生成后续标记的信号);其次,分别提出不确定性损失和标记多样性损失,以增加每个生成标记的不确定性及整个生成序列中所有标记的多样性,从而在标记级和序列级打破输出依赖性;最后,提出时域权重调整算法以有效平衡上述损失。大量实验表明,在MS-COCO和ImageNet数据集上,我们的冗长图像可使生成序列长度分别提升至原始图像的7.87倍和8.56倍,这为各类应用场景带来了潜在挑战。我们的代码已开源至https://github.com/KuofengGao/Verbose_Images。