Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.
翻译:大型视觉语言模型(VLMs)能够实现对文本和图像的联合处理。然而,视觉数据的引入显著增加了提示长度,导致首个令牌生成时间(TTFT)延长。这一瓶颈可以通过利用注意力计算中固有的稀疏性来缓解。通过分析VLMs在处理一系列图像时的注意力模式,我们观察到在相当一部分层中不存在图像间的注意力。基于此,我们提出了BlindSight:一种利用输入模板感知的注意力稀疏掩码来优化多图像VLM推理的方法,该方法不引入运行时开销。我们利用一个数据集推导出与提示无关的注意力头分类:密集型、汇点型、图像内型和图像内+汇点型。我们开发了一个基于Triton的GPU内核来利用这种稀疏性。BlindSight在注意力计算上实现了1.8-3.2倍的加速(提示长度为36K-300K)。BlindSight能够泛化到不同的VLM(Qwen2-VL、Qwen2.5-VL、Gemma 3),在多图像理解基准测试中平均仅导致0.78%的绝对准确率下降。最后,我们倡导设计结合BlindSight启发的稀疏层和密集层的高效VLM。