We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed instructional videos using a natural language question. This task requires a range of skills - the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments, paving a new path for understanding the instructional videos, which ushers in further research.
翻译:我们提出了一项新任务,称为视频语料库视觉答案定位(VCVAL),旨在通过自然语言问题在大规模未剪辑教学视频集合中定位视觉答案。该任务需要多种能力——视觉与语言之间的交互、视频检索、段落理解以及视觉答案定位。在本文中,我们提出了一种跨模态对比全局跨度(CCGS)方法用于VCVAL任务,通过全局跨度矩阵联合训练视频语料库检索和视觉答案定位子任务。我们重构了一个名为MedVidCQA的数据集,并在此数据集上对VCVAL任务进行了基准测试。实验结果表明,所提方法在视频语料库检索和视觉答案定位子任务上均优于其他竞争方法。最重要的是,我们通过大量实验进行了详细分析,为理解教学视频开辟了新路径,推动了进一步的研究。