Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb additional knowledge, which models the interactions between the additional knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the long-tail words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e.g., the appearance of video frames, speech transcripts, and video captions) to ensure the quality of caption results. In addition, the cross attention mechanism is also used in between the two streams for sharing information. In this way, the two streams can help each other for more accurate results. Extensive experiments conducted on four challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions, MSRVTT, and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. Specifically, the proposed TextKG method outperforms the best published results by improving 18.7% absolute CIDEr scores on the YouCookII dataset.
翻译:视频描述旨在使用自然语言描述视频内容。尽管已取得显著进展,但在实际应用中仍有很大的性能提升空间,这主要源于长尾词汇挑战。本文提出一种知识与知识图谱增强的Transformer(TextKG)用于视频描述。值得注意的是,TextKG是一种双流Transformer,由外部流和内部流构成。外部流用于吸收额外知识,建模额外知识(如预构建的知识图谱)与视频固有信息(如显著目标区域、语音转录文本和视频描述)之间的交互,以缓解长尾词汇挑战。同时,内部流用于挖掘视频中的多模态信息(如视频帧表观特征、语音转录文本和视频描述),确保描述结果的质量。此外,双流之间还采用交叉注意力机制实现信息共享,使两流能相互促进以获得更准确的结果。在四个具有挑战性的视频描述数据集(YouCookII、ActivityNet Captions、MSRVTT和MSVD)上进行的大量实验表明,所提方法性能优于现有最优方法。具体而言,所提出的TextKG方法在YouCookII数据集上以绝对CIDEr分数提升18.7%的优势超越了现有最佳公开结果。