Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.
翻译:视频片段检索(VMR)旨在检索与给定查询语义相关的特定视频片段。为解决此任务,现有VMR方法大多仅关注视觉与文本模态,而忽略了互补且重要的音频模态。尽管近期少数研究尝试进行音频-视觉-文本联合推理,但它们平等对待所有模态并仅进行简单嵌入,缺乏面向片段检索的细粒度交互。此类设计存在实践局限性:并非所有音频对视频片段检索均有助益,部分视频的音频可能完全为噪声或背景音,对片段判定毫无意义。为此,我们提出一种新颖的重要性感知多粒度融合模型(IMG),该模型通过学习动态且选择性地聚合音频-视觉-文本上下文以实现VMR。具体而言,在分别将文本引导与视觉、音频模态整合后,我们首先设计了一个伪标签监督的音频重要性预测器,用于预测音频的重要性分数,并据此分配权重以减轻噪声音频的干扰。随后,我们设计了多粒度音频融合模块,在局部、事件和全局三个层级自适应地融合音频与视觉模态,充分捕捉其互补上下文。为进一步解决推理过程中音频模态缺失的挑战,我们提出了跨模态知识蒸馏策略。为评估本方法,我们从原始Charades-STA数据集中人工筛选并重构音频相关样本,构建了新的VMR数据集Charades-AudioMatter,以验证模型利用音频模态的能力。大量实验证明了本方法的有效性,在VMR方法中通过音视频融合达到了最先进的性能。代码已开源:https://github.com/HuiGuanLab/IMG。