Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data. Learning modality agnostic representations in a continuous latent space, however, is often treated as a black-box data-driven training process. It is well-known that the effectiveness of representation learning depends heavily on the quality and scale of training data. For video representation learning, having a complete set of labels that annotate the full spectrum of video content for training is highly difficult if not impossible. These issues, black-box training and dataset bias, make representation learning practically challenging to be deployed for video understanding due to unexplainable and unpredictable results. In this paper, we propose two novel training objectives, likelihood and unlikelihood functions, to unroll semantics behind embeddings while addressing the label sparsity problem in training. The likelihood training aims to interpret semantics of embeddings beyond training labels, while the unlikelihood training leverages prior knowledge for regularization to ensure semantically coherent interpretation. With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search. Extensive experiments on TRECVid and MSR-VTT datasets show the proposed network outperforms several state-of-the-art retrieval models with a statistically significant performance margin.
翻译:跨模态表示学习已成为弥合文本与视觉数据语义鸿沟的新常态。然而,在连续潜在空间中学习与模态无关的表示通常被视为黑盒数据驱动训练过程。众所周知,表示学习的有效性高度依赖于训练数据的质量与规模。对于视频表示学习而言,拥有完整标注训练视频内容全光谱的标签集即便并非不可能,也极具难度。黑盒训练与数据集偏差这两个问题,导致表示学习因结果不可解释且不可预测而难以实际部署于视频理解任务。本文提出两种新颖的训练目标——似然函数与不似然函数——旨在展开嵌入背后的语义信息,同时解决训练中的标签稀疏性问题。似然训练致力于解释超越训练标签的嵌入语义,而不似然训练则利用先验知识进行正则化,以确保语义连贯的可解释性。基于这两种训练目标,我们提出一种新的编码器-解码器网络,用于学习可解释的跨模态表示,并应用于即席视频检索。在TRECVid和MSR-VTT数据集上的大量实验表明,所提出的网络以统计显著的性能优势超越了多个最先进的检索模型。