The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
翻译:深度学习在多个领域的显著成功依赖于大规模标注数据集的可用性。然而,获取标注数据成本高昂且需付出巨大努力,这在视频领域尤为具有挑战性。此外,使用人工标注会导致模型存在学习偏差、域泛化能力弱及鲁棒性差的问题。作为一种替代方案,自监督学习提供了一种无需标注数据的表示学习方法,并在图像和视频领域均展现出良好前景。与图像领域不同,视频表示学习因时间维度的引入更具挑战性,这带来了运动及其他环境动态因素的复杂性,同时也为推进视频及多模态领域的自监督学习提供了基于视频特性的创新机遇。本综述聚焦视频领域,对现有自监督学习方法进行了全面回顾。我们根据学习目标将这些方法归纳为四大类:1)预文本任务、2)生成式学习、3)对比学习、4)跨模态对齐。此外,我们还介绍了常用数据集、下游评估任务、现有工作的局限性洞察以及该领域的潜在未来研究方向。