Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. However, negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent, while most methods still enforce dissimilar representations to decrease their similarity. This phenomenon leads to inaccurate supervision and poor performance in learning video-text representations. While most video retrieval methods overlook that phenomenon, we propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue. First, we design the calculation framework of the adaptive margin, including the method of distance measurement and the function between the distance and the margin. Then, we explore a novel implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD), which can be built on the top of most video retrieval models with few modifications. Notably, CMGSD adds few computational overheads at train time and adds no computational overhead at test time. Experimental results on three widely used datasets demonstrate that the proposed method can yield significantly better performance than the corresponding backbone model, and it outperforms state-of-the-art methods by a large margin.
翻译:视频检索因互联网上视频内容的迅速涌现而变得日益重要。当前视频检索的主流范式通过将正样本对与负样本对的相似度距离推离固定边距来学习视频-文本表示。然而,用于训练的负样本是随机采样的,这意味着负样本间的语义可能相关甚至相同,但大多数方法仍强制要求相异表示以降低其相似度。这种现象导致视频-文本表示学习中的监督不准确及性能低下。针对多数视频检索方法忽视该现象的问题,我们提出一种随正负样本对距离变化的自适应边距。首先,设计自适应边距的计算框架,包括距离度量方法及距离与边距间的函数关系。随后,探索一种创新实现——"跨模态广义自蒸馏"(Cross-Modal Generalized Self-Distillation,CMGSD),该模块可轻量集成于多数视频检索模型之上。值得注意的是,CMGSD在训练阶段仅增加极少量计算开销,且在测试阶段不引入额外计算。在三个广泛使用的数据集上的实验结果表明,所提方法显著优于对应基线模型,并以较大优势超越现有最优方法。