We introduce S$^2$VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: https://github.com/gkordo/s2vs
翻译:我们提出S$^2$VS,一种基于自监督的视频相似性学习方法。自监督学习通常用于在代理任务上训练深度模型,以便在微调后具有较强的目标任务可迁移性。与先前工作不同,本文利用自监督学习进行视频相似性学习,无需使用标注数据即可同时解决多个检索与检测任务。这通过结合任务特化数据增强的实例判别学习、广泛使用的InfoNCE损失函数以及联合作用于自相似性与难负样本相似性的附加损失函数来实现。我们在视频相关性定义具有不同粒度(从视频副本到描绘同一事件或情节的视频)的任务上评估所提方法。我们训练了一个单一通用模型,在所有任务上均达到最先进性能,超越了先前使用标注数据的方法。代码与预训练模型已公开于:https://github.com/gkordo/s2vs