As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.
翻译:句子中最关键的成分——主语、谓语和宾语——在视频字幕描述任务中需要特别关注。为实现这一思路,我们设计了一种新颖框架,命名为协作式三流Transformer(COST),用于分别建模这三个部分并使其相互补充以获得更优的表征。具体而言,COST由三个Transformer分支构成,分别探索视频与文本、检测对象与文本、动作与文本在时空域中不同粒度的视觉-语言交互。同时,我们提出一个跨粒度注意力模块,用于对齐三个Transformer分支建模的交互,使各分支能够相互支持,挖掘不同粒度下最具判别性的语义信息,从而生成精确的字幕预测。整个模型以端到端方式训练。在三个大规模、具有挑战性的数据集(即YouCookII、ActivityNet Captions和MSVD)上进行的广泛实验表明,所提方法性能优于当前最先进方法。