We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.
翻译:本文提出一种高效框架,用于在将海量视频帧特征输入大型多模态模型前进行压缩,从而缓解小时级长视频带来的严重令牌激增问题。该设计采用配备门控跳跃连接的双向状态空间模型,并结合应用于周期性插入的可学习查询的可学习加权平均池化机制。该结构能够实现跨空间与时间维度的层次化下采样,以经济高效的方式保持性能。在具有挑战性的小时级长视频理解任务中,本方法相较于最先进模型展现出具有竞争力的结果,同时显著降低了总体令牌预算。值得注意的是,若将本框架中的状态空间模型替换为传统模块会导致性能大幅下降,这凸显了所提出的状态空间建模在有效压缩多帧视频信息方面的优势。本框架强调资源感知的高效性,使其在实际部署中具备实用性。我们在多个基准测试中验证了其可扩展性与泛化能力,实现了高效资源利用与全面视频理解的双重目标。