Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap
翻译:大多数视频描述生成模型旨在处理几秒的短视频片段,并输出描述低级视觉概念(如物体、场景、原子动作)的文本。然而,现实中多数视频持续数分钟或数小时,且具有跨越不同时间粒度的复杂层次结构。我们提出Video ReCap,一种递归视频描述生成模型,可处理跨度极大(从1秒到2小时)的视频输入,并输出多层次级别的视频描述。该递归视频语言架构利用不同视频层级间的协同效应,能高效处理小时级视频。我们采用课程学习训练方案来学习视频的层次结构:从描述原子动作的片段级描述开始,继而聚焦于段落级描述,最终生成小时级视频摘要。此外,我们通过为Ego4D数据集补充8,267条人工收集的长程视频摘要,构建了Ego4D-HCap数据集。我们的递归模型不仅能灵活生成不同层级描述,还可用于其他复杂视频理解任务,如EgoSchema上的视频问答。数据、代码及模型见:https://sites.google.com/view/vidrecap