Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks. Code and models will be made publicly available at https://github.com/swathikirans/GSF.
翻译:卷积神经网络是图像识别的事实标准模型。然而,3D CNN作为2D CNN在视频识别领域的直接扩展,在标准动作识别基准上的表现并未达到同等成功。3D CNN性能下降的主要原因之一是计算复杂度增加,需要大规模标注数据集才能进行有效训练。为降低3D CNN复杂度,研究者提出了3D核分解方法。现有核分解方法遵循手工设计及硬编码技术。本文提出门-移位-融合(GSF)模块,这是一种新颖的时空特征提取模块,可控制时空分解中的交互作用,并学习以数据依赖方式自适应路由时域特征并融合。GSF利用分组空间门控分解输入张量,并通过通道加权融合分解后的张量。GSF可嵌入现有2D CNN,将其转化为高效且高性能的时空特征提取器,且参数量和计算开销可忽略不计。我们使用两种主流2D CNN家族对GSF进行了全面分析,在五个标准动作识别基准上取得了领先或具有竞争力的性能。代码和模型将于https://github.com/swathikirans/GSF开源。