We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive multiview representations while the cross-view semantic information exhibits variations. We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. The key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. To facilitate the data efficiency of multiview learning, we further perform video-text alignment for first-person and third-person videos, to fully leverage the semantic knowledge to improve video representations. Extensive experiments on multiple benchmark datasets verify the effectiveness of our framework. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario than typical paired or unpaired multimodal or multiview learning. Our code is available at https://github.com/wqtwjt1996/SUM-L.
翻译:我们关注非配对多视角视频学习中的一个挑战性场景。在该场景下,模型旨在学习全面的多视角表示,而跨视角的语义信息存在差异。我们提出基于语义的非配对多视角学习(SUM-L)来解决这一非配对多视角学习问题。其核心思想是通过利用视频的语义信息构建跨视角伪配对,并进行视角不变性对齐。为提升多视角学习的数据效率,我们进一步对第一人称和第三人称视频进行视频-文本对齐,以充分利用语义知识改进视频表示。在多个基准数据集上的大量实验验证了我们框架的有效性。在比典型配对或非配对多模态或多视角学习更具挑战性的场景下,我们的方法也优于多种现有视角对齐方法。我们的代码已开源:https://github.com/wqtwjt1996/SUM-L。