Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods.

翻译：近年来，随着网络视频的爆炸式增长，大规模基于内容的视频检索（CBVR）在视频过滤、推荐和版权保护中变得日益重要。片段级CBVR（S-CBVR）以更细粒度定位相似片段的起始和结束时间，这对提升用户浏览效率及侵权检测尤为有益，尤其在长视频场景中。S-CBVR任务的挑战在于如何在高计算效率与低存储消耗下实现高时间对齐精度。本文提出了一种片段相似性与对齐网络（SSAN）以应对该挑战，该网络首次在S-CBVR中实现了端到端训练。SSAN基于视频检索中两个新提出的模块：（1）高效的自监督关键帧提取（SKE）模块，用于减少冗余帧特征；（2）鲁棒的相似性模式检测（SPD）模块，用于时间对齐。与均匀帧提取相比，SKE不仅节省了特征存储和搜索时间，还引入了相当的精度和有限的计算时间开销。在时间对齐方面，SPD能以比现有深度学习方法更高的精度和效率定位相似片段。此外，我们通过联合训练SKE与SPD实现了SSAN的端到端优化。同时，SKE和SPD这两个关键模块也可有效集成到其他视频检索流程中，并获得显著的性能提升。在公开数据集上的实验结果表明，与现有方法相比，SSAN能在节省存储和在线查询计算成本的同时，获得更高的对齐精度。