Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. We address these limitations by leveraging video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach, CoPE-VideoLM, reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal and motion reasoning, long-form understanding, and spatial scene understanding.
翻译:视频语言模型使人工智能系统能够理解视频中的时间动态。受限于最大上下文窗口约束,现有方法采用关键帧采样,但因其稀疏的时间覆盖常导致宏观事件与微观细节的遗漏。此外,对每帧图像及其令牌进行完整处理会产生大量计算开销。我们通过利用视频编解码器基元(特别是运动矢量和残差信息)来克服这些局限——这些基元天然编码了视频冗余与稀疏性,无需对大多数帧进行昂贵的全图像编码。为此,我们引入轻量级Transformer编码器,通过预训练策略聚合编解码器基元并将其表征与图像编码器嵌入对齐,该策略可加速端到端微调过程中的收敛。我们的方法CoPE-VideoLM相较于标准视频语言模型,将首次令牌生成时间缩短最高86%,令牌使用量减少最高93%。此外,通过调整关键帧与编解码器基元密度,我们在涵盖通用问答、时间与运动推理、长视频理解及空间场景理解等14项多样化视频理解基准测试中,保持甚至超越了现有性能水平。