Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .
翻译:近期视觉语言模型的进展通过增加视觉标记的长度提升了性能,但这些标记远长于文本标记,显著增加了计算成本。然而,我们观察到流行视觉编码器(如CLIP和SigLIP)生成的视觉标记存在显著冗余。为此,我们提出VisionZip——一种简单而有效的方法,通过选择一组信息丰富的标记输入语言模型,在保持模型性能的同时减少视觉标记冗余并提升效率。所提出的VisionZip可广泛应用于图像与视频理解任务,并特别适用于现实场景中的多轮对话(以往方法在此类场景中往往表现欠佳)。实验结果表明,VisionZip在几乎所有设定下均优于先前最优方法,性能提升至少达5%。此外,我们的方法显著提升了模型推理速度:预填充时间加速8倍,使LLaVA-Next 13B模型的推理速度快于LLaVA-Next 7B模型的同时获得更优结果。进一步地,我们分析了这种冗余现象的成因,并呼吁学界关注如何提取更优视觉特征,而非单纯增加标记长度。代码已开源:https://github.com/dvlab-research/VisionZip。