Aligning a user query and video clips in cross-modal latent space and that with semantic concepts are two mainstream approaches for ad-hoc video search (AVS). However, the effectiveness of existing approaches is bottlenecked by the small sizes of available video-text datasets and the low quality of concept banks, which results in the failures of unseen queries and the out-of-vocabulary problem. This paper addresses these two problems by constructing a new dataset and developing a multi-word concept bank. Specifically, capitalizing on a generative model, we construct a new dataset consisting of 7 million generated text and video pairs for pre-training. To tackle the out-of-vocabulary problem, we develop a multi-word concept bank based on syntax analysis to enhance the capability of a state-of-the-art interpretable AVS method in modeling relationships between query words. We also study the impact of current advanced features on the method. Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%.
翻译:跨模态潜在空间中用户查询与视频片段的对齐,以及基于语义概念的方法,是即席视频搜索两大主流技术路线。然而,现有方法的有效性受限于可用的视频-文本数据集规模较小和概念库质量较低,导致无法处理未见查询并出现词汇缺失问题。本文通过构建新数据集和开发多词概念库来解决这两个问题。具体而言,利用生成式模型,我们构建了一个包含700万对生成文本-视频对的新数据集用于预训练。为解决词汇缺失问题,我们基于句法分析开发了多词概念库,以增强现有最优可解释即席视频搜索方法对查询词间关系的建模能力。本文还研究了当前先进特征对该方法的影响。实验结果表明,上述元素的整合使即席视频搜索方法在MSRVTT数据集上的R@1性能提升一倍,并在TRECVid 2016-2023(八年)即席视频搜索查询集上,将xinfAP指标提升幅度从2%到77%不等,平均提升约20%。