Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck

In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.

翻译：在本研究中，我们旨在将大型视觉语言模型（LVLM）的视觉令牌压缩为一种同时适用于（a）生成式任务与（b）判别式任务、（c）近乎无损且（d）存储高效的表示形式。我们提出了一种名为Fwd2Bot的新型压缩方法，该方法利用LVLM自身以任务无关的方式压缩视觉信息。Fwd2Bot的核心在于一种“双重前向传递”训练策略：在第一次前向传递过程中，LVLM中的大型语言模型（LLM）通过将视觉信息压缩为少量摘要令牌来构建瓶颈；随后，在第二次前向传递中，同一LLM将语言指令与摘要令牌共同处理，这些摘要令牌直接替代原始图像令牌。训练信号由两种损失函数提供：在第二次传递后应用的自回归损失为压缩提供直接优化目标，而在第一次传递后应用的对比损失则进一步增强表示能力，尤其针对判别式任务。训练过程通过阶段特异性适配器得到进一步优化。我们通过深入的消融实验验证所提方法。总体而言，Fwd2Bot能生成高度信息化的压缩表示，同时适用于生成式与判别式任务。在生成式任务中，我们实现了2倍压缩率提升且不损害生成能力，创造了新的最优性能记录。在判别式任务中，我们在图像检索与组合性任务上取得了新的最优结果。