Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data. Learning modality agnostic representations in a continuous latent space, however, is often treated as a black-box data-driven training process. It is well-known that the effectiveness of representation learning depends heavily on the quality and scale of training data. For video representation learning, having a complete set of labels that annotate the full spectrum of video content for training is highly difficult if not impossible. These issues, black-box training and dataset bias, make representation learning practically challenging to be deployed for video understanding due to unexplainable and unpredictable results. In this paper, we propose two novel training objectives, likelihood and unlikelihood functions, to unroll semantics behind embeddings while addressing the label sparsity problem in training. The likelihood training aims to interpret semantics of embeddings beyond training labels, while the unlikelihood training leverages prior knowledge for regularization to ensure semantically coherent interpretation. With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search. Extensive experiments on TRECVid and MSR-VTT datasets show the proposed network outperforms several state-of-the-art retrieval models with a statistically significant performance margin.
翻译:跨模态表征学习已成为弥合文本与视觉数据语义鸿沟的新常态。然而,在连续隐空间中学习与模态无关的表征通常被视为一种黑箱数据驱动训练过程。众所周知,表征学习的有效性高度依赖于训练数据的质量与规模。对于视频表征学习而言,获取能够标注全部视频内容光谱的完整标签集即便并非不可能,也极其困难。这些问题——黑箱训练与数据集偏差——使得表征学习因结果无法解释且不可预测,在实践中难以部署于视频理解任务。本文提出两种新型训练目标函数——似然函数与不似然函数——以在解决训练中标签稀疏问题的同时,展开嵌入背后的语义信息。似然训练旨在解释超越训练标签的嵌入语义,而不似然训练则利用先验知识进行正则化,确保语义一致的可解释性。基于这两种训练目标,本文提出一种新型编码器-解码器网络,用于学习可解释跨模态表征,并应用于即席视频检索。在TRECVid与MSR-VTT数据集上的大量实验表明,所提网络以统计显著的性能优势超越了多种最先进的检索模型。