To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space where embedding distance gets properly translated into their semantic similarity. In this work, we focused on this semantic connection between video and language, and developed a multi-level alignment training scheme to directly shape the encoding process. Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. The contrastive loss was used to contrast the encodings' similarities between the positive and negative alignment pairs, and to ensure the network is trained in such a way that similar information is encoded closely in the shared feature space while information of different semantics is kept apart. Our multi-level alignment training can be applied to various video-and-language grounding tasks. Together with the task-specific training loss, our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.
翻译:为解决视频与语言定位任务,关键在于网络能够理解两种模态之间的关联。对于一对视频与语言描述,其语义关系体现在编码表示的相似性上。优质的多模态编码器应能充分捕捉两种输入的语义特征,并将其编码于共享特征空间中,使得嵌入距离能够恰当地反映语义相似度。本文聚焦于视频与语言间的语义关联,提出了一种多层级对齐训练方案,直接引导编码过程。基于从高层上下文到细粒度语义的信息相似度,我们设计了全局与片段两个层级的视频-语言对齐对。通过对比损失函数,对正负对齐对的编码相似性进行对比,确保网络训练过程中相似信息在共享特征空间中紧密聚合,而不同语义信息则保持分离。该多层级对齐训练可应用于多种视频与语言定位任务。结合任务特定训练损失,本框架在多个视频问答与检索数据集上取得了与先前最优方法相当的性能。