Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-language models for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.
翻译:将长视频分割为章节使用户能够快速导航至感兴趣的信息。这一重要课题因缺乏公开数据集而研究不足。为解决该问题,我们提出了VidChapters-7M数据集,包含81.7万段用户标注了章节的视频,总计700万个章节。VidChapters-7M通过自动化方式从在线视频中可扩展地抓取用户标注的章节生成,无需额外的人工标注。基于该数据,我们引入以下三项任务:首先,视频章节生成任务需要对视频进行时间分割并为每个片段生成章节标题。为进一步剖析问题,我们还定义了该任务的两种变体:给定真实边界的视频章节生成(需为已标注的片段生成章节标题)和视频章节定位(需根据已标注的标题定位对应的时间区间)。我们针对这三项任务,对简单基线模型与当前先进的视频-语言模型进行了基准测试。此外,研究表明,在VidChapters-7M上的预训练可有效迁移到密集视频描述任务中(包括零样本与微调设置),显著提升了YouCook2和ViTT基准测试的现有最优性能。最后,实验表明下游性能随预训练数据集规模呈良好扩展性。我们的数据集、代码和模型已在https://antoyang.github.io/vidchapters.html公开。