Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap
翻译:大多数视频描述生成模型设计用于处理数秒的短视频片段,并输出描述低级视觉概念(如物体、场景、原子动作)的文本。然而,现实视频大多持续数分钟或数小时,并具有跨不同时间粒度的复杂层次结构。我们提出Video ReCap——一种递归视频描述模型,能够处理从1秒到2小时等长度差异极大的视频输入,并在多个层级输出视频描述。该递归视频-语言架构利用不同视频层级间的协同作用,可高效处理小时级视频。我们采用课程学习训练方案来学习视频的层次结构:从描述原子动作的片段级标题开始,随后聚焦于片段级描述,最后生成小时级视频的摘要。此外,我们通过为Ego4D扩充8,267条人工收集的长视频摘要,构建了Ego4D-HCap数据集。我们的递归模型不仅能灵活生成不同层级的描述,还可用于其他复杂视频理解任务(如EgoSchema上的视频问答)。数据、代码和模型已发布于:https://sites.google.com/view/vidrecap