Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods. Code:https://github.com/dun-research/DRCA
翻译:随着各领域对视频分析的需求日益增长,优化视频推理效率变得愈发重要。现有方法通过显式丢弃空间或时间信息实现高效率,但在快速变化和细粒度场景中面临挑战。针对这些问题,我们提出一种基于可微分分辨率压缩与对齐机制的高效视频表示网络,该网络在早期阶段压缩非必要信息以降低计算成本,同时保持一致的时序相关性。具体而言,我们利用可微分上下文感知压缩模块编码显著与非显著帧特征,将特征精炼并更新为高低分辨率视频序列。为处理新序列,我们引入分辨率对齐Transformer层捕获不同分辨率帧特征间的全局时序相关性,同时通过利用低分辨率非显著帧中的更少空间标记,将空间计算成本呈二次方降低。通过可微分压缩模块的集成,整个网络可实现端到端优化。实验结果表明,与现有最优方法相比,我们的方法在近重复视频检索中实现了效率与性能的最佳平衡,在动态视频分类中取得了具有竞争力的结果。代码:https://github.com/dun-research/DRCA