Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.
翻译:现有视频-语言研究主要关注短视频片段的学习,由于长视频建模的过高计算成本,长期时间依赖关系鲜有探索。为解决此问题,一种可行的方案是学习视频片段与字幕之间的对应关系,但这不可避免地会引入多粒度噪声对应(MNC)问题。具体而言,MNC指片段-字幕错配(粗粒度)和帧-词错配(细粒度),阻碍了时序学习和视频理解。本文提出噪声鲁棒时序最优传输(Norton)方法,在统一的最优传输(OT)框架内解决MNC问题。简言之,Norton基于OT利用视频-段落和片段-字幕对比损失捕获长期依赖关系。为解决视频-段落对比中的粗粒度错配,Norton通过可对齐提示桶过滤无关片段和字幕,并基于传输距离重新对齐异步的片段-字幕对。为解决细粒度错配,Norton引入软最大值算子以识别关键词和关键帧。此外,Norton通过利用OT分配修正对齐目标,挖掘片段-字幕对比中潜在的故障负样本,确保精确的时序建模。在视频检索、视频问答和动作分割上的大量实验验证了我们方法的有效性。代码可在https://lin-yijie.github.io/projects/Norton获取。