Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using the ALFRED dataset [54] generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].
翻译:现有密集或段落级视频描述方法依赖于视频的整体表示,并结合学习到的物体/动作表示来调节分层语言解码器。然而,这些方法从根本上缺乏理解事件进展、因果关系乃至场景中某些物体功能所需的常识知识。为解决这一局限,我们提出了一种新颖的基于Transformer的视频描述模型,该模型同时考虑隐式(视觉-语言和纯语言)和显式(知识库)常识知识。我们证明,这些知识形式在单独或组合使用下,均能提升生成描述的质量。此外,受模仿学习启发,我们提出了一个新的指令生成任务,其目标是从视频演示中生成一组语言指令。我们使用基于AI2-THOR环境生成的ALFRED数据集[54]对该任务进行了形式化定义。尽管指令生成在概念上与段落描述相似,但区别在于它表现出更强的物体持久性,以及空间感知和因果句子结构。我们证明,基于常识知识增强的方法在该任务上取得了显著改进(METEOR指标提升高达57%,CIDEr指标提升8.5%),并且在ActivityNet Captions数据集[29]的传统视频描述任务上达到了最新技术水平。