In this paper, we address the challenge of obtaining large-scale unlabelled video datasets for contrastive representation learning in real-world applications. We present a novel video augmentation technique for self-supervised learning, called Cross-Modal Manifold Cutmix (CMMC), which generates augmented samples by combining different modalities in videos. By embedding a video tesseract into another across two modalities in the feature space, our method enhances the quality of learned video representations. We perform extensive experiments on two small-scale video datasets, UCF101 and HMDB51, for action recognition and video retrieval tasks. Our approach is also shown to be effective on the NTU dataset with limited domain knowledge. Our CMMC achieves comparable performance to other self-supervised methods while using less training data for both downstream tasks.
翻译:本文针对实际应用中获取大规模无标注视频数据集用于对比表征学习这一挑战,提出一种名为跨模态流形Cutmix(CMMC)的新型视频增强技术,该技术通过融合视频中不同模态生成增强样本。通过在特征空间中将一个视频超立方体嵌入到另一模态的视频超立方体中,本方法提升了所学视频表征的质量。我们在两个小规模视频数据集UCF101和HMDB51上开展了动作识别与视频检索任务的广泛实验。在具有有限领域知识的NTU数据集上,我们的方法同样展现出有效性。在两种下游任务中,CMMC在使用较少训练数据的情况下达到了与其他自监督方法相当的性能。