We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs), we generate rich DVC-oriented caption candidates and optimize the corresponding pseudo boundaries under several meticulously designed objectives, considering diversity, event-centricity, temporal ordering, and coherence. Moreover, we further introduce a novel online boundary refinement strategy that iteratively improves the quality of pseudo boundaries during training. Comprehensive experiments have been conducted to examine the effectiveness of the proposed technique components. By leveraging a substantial amount of unlabeled video data, such as HowTo100M, we achieve a remarkable advancement on standard DVC datasets like YouCook2 and ActivityNet. We outperform the previous state-of-the-art Vid2Seq across a majority of metrics, achieving this with just 0.4% of the unlabeled video data used for pre-training by Vid2Seq.
翻译:摘要:我们提出 Dive Into the BoundarieS(DIBS)——一种面向密集视频描述生成(DVC)的新型预训练框架,其核心在于利用无标签视频数据,提升生成事件描述及其对应伪事件边界的质量。通过融合多种大型语言模型(LLM)的能力,我们生成丰富且面向DVC的描述候选,并在精心设计的多项目标(兼顾多样性、事件中心性、时序顺序与连贯性)下优化对应的伪边界。此外,我们进一步引入一种新颖的在线边界优化策略,在训练过程中迭代提升伪边界的质量。通过系统性实验验证了所提出技术组件的有效性。利用大量无标签视频数据(如HowTo100M),我们在YouCook2和ActivityNet等标准DVC数据集上取得了显著进展。以仅占Vid2Seq预训练所用无标签视频数据0.4%的规模,我们在多数指标上超越了此前最优的Vid2Seq模型。