Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a hierarchical self-attention layer which produces an attention map that could be used for the next-step high-resolution token selection. The hierarchical self-attention process and high-resolution token selection process are performed iteratively for each attention layer. Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while also significantly reducing the computational cost by nearly 40%.
翻译:当前的高分辨率视觉语言模型将图像编码为高分辨率图像标记,并穷举式地使用所有这些标记计算注意力,这显著增加了计算成本。为解决此问题,我们提出FlexAttention,一种用于高效高分辨率视觉语言模型的灵活注意力机制。具体而言,高分辨率图像被同时编码为高分辨率标记和低分辨率标记,其中仅低分辨率标记及少量选定的高分辨率标记被用于计算注意力图,从而大幅缩减计算开销。高分辨率标记通过一个高分辨率选择模块进行选取,该模块能够基于输入的注意力图检索相关区域的标记。选定的高分辨率标记随后与低分辨率标记及文本标记拼接,并输入至分层自注意力层,该层生成的注意力图可用于下一步的高分辨率标记选择。分层自注意力过程与高分辨率标记选择过程在每个注意力层中迭代执行。在多模态基准测试上的实验证明,我们的FlexAttention优于现有高分辨率视觉语言模型(例如在V* Bench上相对提升约9%,在TextVQA上相对提升约7%),同时将计算成本显著降低近40%。