Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.
翻译:有效的视频标记化对于扩展长视频的Transformer模型至关重要。当前方法使用时空块对视频进行标记化,导致令牌过多和计算效率低下。最佳的令牌缩减策略在摄像机移动时性能下降且几乎无法减少令牌数量。我们引入接地视频标记化这一范式,其基于全景子对象轨迹而非固定块来组织令牌。该方法符合基本感知原理,确保标记化反映场景复杂度而非视频时长。我们提出TrajViT——一种提取对象轨迹并将其转化为语义上有意义令牌的视频编码器,在显著降低冗余的同时保持时序连贯性。通过对比学习训练,TrajViT在多个视频理解基准测试中显著优于时空ViT(ViT3D),例如在视频-文本检索任务中,TrajViT以10倍令牌缩减实现平均top-5召回率比ViT3D高出6%的显著优势。我们还证明TrajViT作为现代VideoLLM的视频编码器比ViT3D更强大,在6个VideoQA基准测试中平均性能提升5.2%,同时训练时间缩短4倍,推理FLOPs减少18倍。TrajViT是首个在多种视频分析任务中持续优于ViT3D的高效编码器,成为鲁棒且可扩展的解决方案。