Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
翻译:尽管多模态大语言模型(MLLMs)展现出卓越性能,但其部署需要大量计算资源。一旦恶意用户诱使模型产生高能耗与延迟时间(能量-延迟成本),将导致计算资源耗尽并危害服务可用性。本文针对MLLMs(特别是基于图像和视频的模型)研究此类脆弱性,旨在通过构造不可察觉的扰动来诱导推理阶段的高能量-延迟成本。我们发现,通过最大化生成序列长度即可操纵高能量-延迟成本,这促使我们提出包含冗长图像与视频的冗长样本。具体而言,我们提出了两种模态非特异性损失函数:用于延迟序列结束(EOS)令牌的损失,以及增加每个生成令牌不确定性的不确定性损失。此外,通过增强复杂性的方式提升输出多样性对激励更长响应至关重要,这启发了以下模态特异性损失:针对冗长图像,提出令牌多样性损失以促进隐藏状态的多样性;针对冗长视频,提出帧特征多样性损失以增强帧间特征多样性。为平衡这些损失,我们提出时序权重调整算法。实验表明,我们提出的冗长样本能显著延长生成序列的长度。