The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.
翻译:在线视频资源的快速增长极大地推动了视频检索方法的发展。作为视频检索的标准评估指标,平均精度(AP)通过评估相关视频在排序列表顶部的整体排名情况,使预测得分成为用户可靠的参考依据。然而,现有的视频检索方法普遍采用成对损失函数,平等对待所有样本对,导致训练目标与评估指标之间存在明显差距。为有效弥合这一差距,本研究致力于解决两个核心挑战:a) 当前相似性度量方法与基于AP的损失函数对视频检索任务并非最优;b) 帧间匹配产生的显著噪声会干扰AP损失的准确估计。针对这些挑战,我们提出了面向平均精度优化的视频检索分层学习框架(HAP-VR)。针对前者,我们开发了TopK-Chamfer相似性度量与四线性AP损失函数,以AP为导向对视频级相似性进行度量与优化。针对后者,我们提出通过约束帧级相似性来实现精确的AP损失估计。实验结果表明,HAP-VR在多个基准数据集上优于现有方法,为视频检索任务提供了可行的解决方案,从而为多媒体应用带来潜在价值。