Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
翻译:视频语言模型(VideoLMs)使人工智能系统能够理解视频中的时序动态。为适应最大上下文窗口约束,现有方法采用关键帧采样,但由于时序覆盖稀疏,可能同时遗漏宏观事件与微观细节。此外,对每帧完整图像及其标记进行处理会产生巨大的计算开销。为克服这些局限,我们提出利用视频编解码器基元(特别是运动向量与残差),其天然编码了视频的冗余性与稀疏性,无需对多数帧进行昂贵的完整图像编码。为此,我们引入了基于Transformer的轻量级编码器,通过预训练策略聚合编解码器基元并将其表示与图像编码器嵌入对齐,从而加速端到端微调期间的收敛。与标准VideoLMs相比,我们的方法将首标记生成时间降低达$86\%$,标记使用量减少达$93\%$。此外,通过调整关键帧与编解码器基元密度,我们能够在涵盖通用问答、时序推理、长视频理解及空间场景理解的$14$个多样化视频理解基准上保持或超越原有性能。