VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity. Source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.

翻译：视频段落描述旨在通过连贯的故事叙述，为包含多个时序事件片段的无剪辑视频生成多句描述。受人类感知过程（即通过视觉与语言的相互影响，将场景分解为视觉成分（如人、动物）与非视觉成分（如动作、关系）以有效理解场景）启发，我们首先提出一种视觉-语言（VL）特征。在该VL特征中，场景通过三种模态建模，包括：（i）全局视觉环境；（ii）局部视觉主要主体；（iii）语言场景要素。随后，我们引入一种自回归的嵌套Transformer（TinT），以同时捕获视频内事件内部与事件之间的语义连贯性。最后，我们提出一种新的VL对比损失函数，以确保学习到的嵌入特征与描述语义相匹配。在ActivityNet Captions和YouCookII数据集上的综合实验和广泛的消融研究表明，所提出的视觉-语言嵌套Transformer（VLTinT）在准确性和多样性上均优于先前的最先进方法。源代码已在以下网址公开：https://github.com/UARK-AICV/VLTinT。

相关内容