The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.
翻译:卷积神经网络(CNN)与视觉Transformer的最新进展已令人信服地证明了其在大规模数据集上对视频动作识别的高学习能力。然而,在训练视频数量有限的小规模数据集上,深度模型常受到过拟合效应的影响。一种常见的解决方案是对每一帧单独采用现有的图像增强策略,包括Mixup、Cutmix和RandAugment,但这些方法并非专门针对视频数据优化。本文提出了一种名为选择性体素混合(SV-Mix)的新型视频增强策略,以提升深度模型在有限训练视频下的泛化能力。SV-Mix设计了一个可学习的选择性模块,从两个视频中选择最具信息量的体素,并通过混合这些体素生成新的训练视频。技术上,我们提出了两个新模块:空间选择性模块(用于为每个空间位置选择局部图像块)和时间选择性模块(用于混合每个时间戳的完整帧并保持空间模式)。每次训练时,我们随机选择其中一个模块以扩展训练样本的多样性。选择性模块与视频动作识别框架联合优化,以寻找最优增强策略。我们通过大量视频动作识别基准测试实证验证了SV-Mix增强策略的优势,并一致提升了基于CNN和基于Transformer模型的性能。