We introduce S$^2$VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: \url{https://github.com/gkordo/s2vs}
翻译:我们提出S$^2$VS,一种基于自监督的视频相似度学习方法。自监督学习通常用于在代理任务上训练深度模型,以便在微调后对目标任务具有较强的迁移能力。与此前工作不同,本文利用自监督学习执行视频相似度学习,并在无标注数据的情况下同时处理多个检索与检测任务。这是通过结合任务定制化数据增强的实例判别学习、广泛使用的InfoNCE损失以及同时作用于自相似性与难负样本相似性的额外损失来实现的。我们在不同粒度定义视频相关性的任务上对方法进行基准测试,范围涵盖从视频拷贝到描述同一事件或事故的视频。我们训练了一个单一通用模型,该模型在所有任务上均达到了最优性能,超越了此前使用标注数据的方法。代码与预训练模型已公开于:\url{https://github.com/gkordo/s2vs}