Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https://github.com/svip-lab/WeakSVR
翻译:时序视频理解作为新兴的视频理解任务,因其目标导向特性而引发广泛关注。本文研究弱监督时序视频理解,其难点在于未提供精确的时间戳级文本-视频对齐。受CLIP启发,我们采用Transformer聚合帧级特征以生成视频表示,并利用预训练文本编码器分别编码各动作对应的文本及完整视频描述。为建模文本与视频间的对应关系,我们提出多粒度损失函数:视频-段落对比损失强制全局视频与完整脚本的匹配,而细粒度帧-句对比损失则推动每个动作与其描述的精确匹配。针对帧-句对应关系缺失问题,我们利用视频动作在时序域中顺序发生的特性生成伪帧-句对应关系,并通过伪标签监督网络训练。在视频序列验证与文本-视频匹配任务上的大量实验表明,本方法显著超越基线模型,验证了所提方案的有效性。代码开源地址:https://github.com/svip-lab/WeakSVR