Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and videos. To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. Video tokens are efficiently compressed with exploiting the inherent temporal redundancy. Images are repeated as static videos, and the spatial details can be gradually supplemented in multiple frames. PVC unifies the token compressing of images and videos. With a limited number of tokens per frame (64 tokens by default), spatial details and temporal changes can still be preserved. Experiments show that our model achieves state-of-the-art performance across various video understanding benchmarks, including long video tasks and fine-grained short video tasks. Meanwhile, our unified token compression strategy incurs no performance loss on image benchmarks, particularly in detail-sensitive tasks.
翻译:大型视觉语言模型(VLMs)已扩展至同时理解图像与视频。视觉令牌压缩技术被用于缩减视觉输入中过长的令牌序列。为满足不同任务需求,现有高性能模型通常采用不同的令牌压缩策略分别处理图像与视频,限制了图像与视频的融合处理能力。为此,我们将每幅图像扩展为“静态”视频,并提出一种名为渐进式视觉令牌压缩的统一令牌压缩策略。该策略逐帧编码令牌,并通过自适应压缩补充先前帧未提取的信息。视频令牌通过利用固有的时间冗余性实现高效压缩;图像则被重复处理为静态视频,其空间细节可在多帧中逐步补充。PVC实现了图像与视频令牌压缩的统一处理。在每帧令牌数量有限(默认64个令牌)的情况下,空间细节与时间变化仍能得到保留。实验表明,我们的模型在各类视频理解基准测试(包括长视频任务与细粒度短视频任务)中均达到最先进性能。同时,这种统一的令牌压缩策略在图像基准测试中(特别是细节敏感任务)未造成性能损失。