The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.
翻译:卷积神经网络(CNNs)和视觉Transformer的最新进展充分展示了在大规模数据集上进行视频动作识别的高度学习能力。然而,在训练视频数量有限的小规模数据集上,深度模型往往会出现过拟合现象。常见的解决方案是对每一帧单独应用现有的图像增强策略,包括Mixup、Cutmix和RandAugment,但这些策略并非专门针对视频数据进行优化。在本文中,我们提出了一种新颖的视频增强策略,名为选择性体积混合(SV-Mix),以提升深度模型在有限训练视频情况下的泛化能力。SV-Mix设计了一个可学习的选择模块,从两个视频中选取信息量最大的体积,并将这些体积混合以生成新的训练视频。从技术上讲,我们提出了两个新模块:空间选择模块,用于选择每个空间位置的局部补丁;时间选择模块,用于混合每个时间戳的完整帧并保持空间模式。每次我们随机选择两个模块中的一个,以扩展训练样本的多样性。这些选择模块与视频动作识别框架联合优化,以找到最优的增强策略。我们通过实验在多个视频动作识别基准上证明了SV-Mix增强的优势,并一致提升了基于CNN和基于Transformer的模型的性能。