VisionZip: Longer is Better but Not Necessary in Vision Language Models

Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

翻译：近期视觉语言模型的进展通过增加视觉标记的长度提升了性能，但这些标记远长于文本标记，显著增加了计算成本。然而，我们观察到流行视觉编码器（如CLIP和SigLIP）生成的视觉标记存在显著冗余。为此，我们提出VisionZip——一种简单而有效的方法，通过选择一组信息丰富的标记输入语言模型，在保持模型性能的同时减少视觉标记冗余并提升效率。所提出的VisionZip可广泛应用于图像与视频理解任务，并特别适用于现实场景中的多轮对话（以往方法在此类场景中往往表现欠佳）。实验结果表明，VisionZip在几乎所有设定下均优于先前最优方法，性能提升至少达5%。此外，我们的方法显著提升了模型推理速度：预填充时间加速8倍，使LLaVA-Next 13B模型的推理速度快于LLaVA-Next 7B模型的同时获得更优结果。进一步地，我们分析了这种冗余现象的成因，并呼吁学界关注如何提取更优视觉特征，而非单纯增加标记长度。代码已开源：https://github.com/dvlab-research/VisionZip。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日