Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.
翻译:视频表示学习在零样本迁移的视频-文本预训练中取得了成功,其中每个句子被训练为在共同特征空间中与配对视频片段接近。对于长视频,给定一段描述性段落,其中句子描述视频的不同片段,通过匹配所有句子-片段对,段落与完整视频被隐式对齐。然而,这种单元级比较可能忽略全局时序上下文,这不可避免地限制了泛化能力。本文提出了一种对比学习框架TempCLR,显式地比较完整视频与段落。由于视频/段落被公式化为片段/句子序列,在其时序顺序的约束下,我们使用动态时间规整计算句子-片段对上的最小累积代价作为序列级距离。为了探索时序动态,我们通过根据时序粒度打乱视频片段来破坏时序连续的一致性。然后,我们获得感知时序信息的片段/句子表示,从而促进序列对齐。除了在视频和段落上进行预训练外,我们的方法还能泛化到视频实例间的匹配上。我们在视频检索、动作步骤定位和少样本动作识别任务上评估了该方法,并在所有三个任务上取得了持续的性能提升。提供了详细的消融研究以证明方法设计的合理性。