Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Dense video captioning, a task of localizing meaningful moments and generating relevant captions for videos, often requires a large, expensive corpus of annotated video segments paired with text. In an effort to minimize the annotation cost, we propose ZeroTA, a novel method for dense video captioning in a zero-shot manner. Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time by optimizing solely on the input. This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a language model. This joint optimization aligns a frozen language generation model (i.e., GPT-2) with a frozen vision-language contrastive model (i.e., CLIP) by maximizing the matching score between the generated text and a moment within the video. We also introduce a pairwise temporal IoU loss to let a set of soft moment masks capture multiple distinct events within the video. Our method effectively discovers diverse significant events within the video, with the resulting captions appropriately describing these events. The empirical results demonstrate that ZeroTA surpasses zero-shot baselines and even outperforms the state-of-the-art few-shot method on the widely-used benchmark ActivityNet Captions. Moreover, our method shows greater robustness compared to supervised methods when evaluated in out-of-domain scenarios. This research provides insight into the potential of aligning widely-used models, such as language generation models and vision-language models, to unlock a new capability: understanding temporal aspects of videos.

翻译：密集视频描述任务旨在定位视频中有意义的片段并生成相应描述，通常需要大规模、昂贵的人工标注视频片段与文本配对数据集。为降低标注成本，我们提出ZeroTA——一种零样本密集视频描述新方法。该方法无需任何视频或标注进行训练，而是在测试阶段仅通过输入视频的优化来定位并描述其中的事件。我们通过引入表示视频时间段的软掩码，并使其与语言模型的前缀参数联合优化来实现这一目标。这种联合优化通过最大化生成文本与视频片段之间的匹配分数，将冻结的语言生成模型（如GPT-2）与冻结的视觉-语言对比模型（如CLIP）对齐。我们还引入成对时间交并比损失，使一组软掩码能够捕捉视频中的多个不同事件。该方法有效发掘视频中多样的重要事件，生成的描述能恰当反映这些事件。实验结果表明，ZeroTA在广泛使用的基准数据集ActivityNet Captions上不仅超越零样本基线方法，甚至优于最先进的少样本方法。此外，在域外场景评估中，该方法相比监督方法展现出更强的鲁棒性。本研究揭示了通过对齐语言生成模型与视觉-语言模型等通用模型，可解锁理解视频时间维度这一新能力的潜力。