Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
翻译:大语言模型(LLMs)已在各类任务中得到广泛应用,这促使我们开发基于LLM的视频助手。我们提出一种模块化方案,可将任意训练成熟的图像级LLMs转化为视频LLMs(经视频数据训练后)。为更好地使图像级LLMs适应视频处理,我们引入两项设计原则:通过线性变换保持原始视觉-语言对齐,以及对冗余视频内容进行代表性信息浓缩。基于这些原则,我们提出即插即用的线性视频标记器(LinVT),使现有图像级LLMs能够理解视频。我们在六种前沿视觉LLMs(Aquila、Blip-3、InternVL2、Mipha、Molmo与Qwen2-VL)上对LinVT进行基准测试,验证了其高度兼容性。基于LinVT的LLMs在多项视频基准测试中达到最先进性能,证明了LinVT在多模态视频理解中的有效性。