Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.
翻译:视频-文本检索旨在根据给定句子中的语义找到最相关的视频,反之亦然。通常,该检索任务包含四个连续步骤:视频与文本特征表示提取、特征嵌入与匹配以及目标函数设计。最终,基于查询与数据集中检索样本的匹配相似度,对检索结果列表进行排序。近年来,深度学习技术取得了显著且蓬勃的发展,但视频-文本检索仍是一项具有挑战性的任务,其难点包括如何学习高效的空时视频特征以及如何缩小跨模态差异等问题。本综述回顾并总结了100多篇与视频-文本检索相关的研究论文,展示了在多个常用基准数据集上的最新性能,并探讨了潜在的挑战与研究方向,旨在为该领域的研究人员提供有价值的见解。