Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i.e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video. To study how to retrieve video with both modal inputs, i.e., visual and text semantic representations, we first introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR, which contains 42.2k sentence queries for 10.5k videos of 8 scenario domains, i.e., Street View (indoor), Street View (outdoor), Games, Sports, Driving, Activity, TV Show, and Cooking. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task. Besides, we present a detailed analysis of TextVR compared to the existing datasets and design a novel multimodal video retrieval baseline for the text-based video retrieval task. The dataset analysis and extensive experiments show that our TextVR benchmark provides many new technical challenges and insights from previous datasets for the video-and-language community. The project website and GitHub repo can be found at https://sites.google.com/view/loveucvpr23/guest-track and https://github.com/callsys/TextVR, respectively.
翻译:现有的跨模态语言到视频检索(VR)研究主要聚焦于视频的单模态输入,即视觉表征,而文本在人类环境中无处不在,且常对理解视频至关重要。为研究如何利用视觉和文本语义两种模态输入检索视频,我们首先引入一个带有文本阅读理解的大规模跨模态视频检索数据集TextVR,该数据集包含来自8个场景领域(即室内街景、室外街景、游戏、体育、驾驶、活动、电视节目和烹饪)的10.5k个视频的42.2k条句子查询。提出的TextVR要求统一的跨模态模型能够识别并理解文本、将其与视觉上下文关联,并确定哪些文本语义信息对视频检索任务至关重要。此外,我们对比现有数据集对TextVR进行了详细分析,并为基于文本的视频检索任务设计了一个新颖的多模态视频检索基线。数据集分析与大量实验表明,我们的TextVR基准为视频与语言社区提供了许多以往数据集所不具备的新技术挑战与见解。项目网站和GitHub仓库分别位于https://sites.google.com/view/loveucvpr23/guest-track 和 https://github.com/callsys/TextVR。