Temporal-attentive Covariance Pooling Networks for Video Recognition

For video recognition task, a global representation summarizing the whole contents of the video snippets plays an important role for the final performance. However, existing video architectures usually generate it by using a simple, global average pooling (GAP) method, which has limited ability to capture complex dynamics of videos. For image recognition task, there exist evidences showing that covariance pooling has stronger representation ability than GAP. Unfortunately, such plain covariance pooling used in image recognition is an orderless representative, which cannot model spatio-temporal structure inherent in videos. Therefore, this paper proposes a Temporal-attentive Covariance Pooling(TCP), inserted at the end of deep architectures, to produce powerful video representations. Specifically, our TCP first develops a temporal attention module to adaptively calibrate spatio-temporal features for the succeeding covariance pooling, approximatively producing attentive covariance representations. Then, a temporal covariance pooling performs temporal pooling of the attentive covariance representations to characterize both intra-frame correlations and inter-frame cross-correlations of the calibrated features. As such, the proposed TCP can capture complex temporal dynamics. Finally, a fast matrix power normalization is introduced to exploit geometry of covariance representations. Note that our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition. The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability. The source code is publicly available.

翻译：对于视频识别任务,一个概述视频片段全部内容的全球代表制对于最终性表演具有重要作用。然而,现有的视频结构通常通过使用一种简单的全球平均共享(GAP)方法生成它,这种方法捕捉视频复杂动态的能力有限。对于图像识别任务,有证据表明,共变集合比GAP具有更强的代表能力。不幸的是,在图像识别中使用的这种普通共变集合是一种无定序的代表制,它无法模拟视频所固有的空间时空结构。因此,本文件建议采用一个时间性强化聚合(TCP),插入到深层结构的末尾,以产生强大的视频演示。具体地说,我们的TCP首先开发一个时间性关注模块,以适应性校正的阵列空间特征校准(Syal-creal Commission1), 类似时间性聚合集成一个时间性模型, 用于校正的TCP结构, 用于校准的轨定的轨迹结构, 最终的平时空结构, 将显示一个精确的轨迹结构, 用于快速的轨迹结构。