In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the video paragraph captioning task and the standard task of video clip captioning. Our code and models will be publicly released at https://antoyang.github.io/vid2seq.html.
翻译:本文提出Vid2Seq——一种基于大规模可用叙述视频进行预训练的多模态单阶段密集事件字幕描述模型。Vid2Seq架构通过为语言模型添加特殊时间标记,使其能够在同一输出序列中无缝预测事件边界与文本描述。这类统一模型需要大规模训练数据,而现有标注数据集尚不具备此条件。我们证明,通过将转录语音的句子边界重构为伪事件边界,并将转录语音句子作为伪事件字幕,即可利用未标注的叙述视频实现密集视频字幕描述。经YT-Temporal-1B数据集预训练的Vid2Seq模型在YouCook2、ViTT和ActivityNet Captions等多个密集视频字幕描述基准测试上均达到当前最优性能。该模型还可有效泛化至视频段落字幕描述任务及标准的视频片段字幕描述任务。我们的代码与模型将在https://antoyang.github.io/vid2seq.html公开发布。