Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GL-RG framework for video captioning, namely a \textbf{G}lobal-\textbf{L}ocal \textbf{R}epresentation \textbf{G}ranularity. Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning behavior. Experimental results on the challenging MSR-VTT and MSVD datasets show that our DL-RG outperforms recent state-of-the-art methods by a significant margin. Code is available at \url{https://github.com/ylqi/GL-RG}.
翻译:视频字幕生成是一项具有挑战性的任务,因为它需要将视觉理解准确转化为自然语言描述。迄今为止,最先进的方法在视频帧间进行全局-局部表示建模以实现字幕生成方面仍存在不足,留有较大的改进空间。本文从新的视角处理视频字幕生成任务,提出一种名为GL-RG的框架,即**全局-局部表示粒度**。与先前研究相比,我们的GL-RG具有三大优势:1)显式利用不同视频范围的广泛视觉表示以提升语言表达;2)设计新颖的全局-局部编码器,生成丰富的语义词汇,从而获得视频帧间内容的描述粒度;3)开发增量式训练策略,以递进方式组织模型学习,诱导最优的字幕生成行为。在具有挑战性的MSR-VTT和MSVD数据集上的实验结果表明,我们的GL-RG以显著优势超越了近期最先进方法。代码已开源至 \url{https://github.com/ylqi/GL-RG}。