This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations(i.e., the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as positivity. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand which objects are exactly relevant to the question and which are making sounds. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance.
翻译:本文聚焦于音频-视觉问答(AVQA)任务,旨在从未经剪辑的可听视频中回答相关问题。为生成准确答案,AVQA模型需找出与问题最相关的信息性音频-视觉线索。本文提出显式考虑视频帧中的细粒度视觉对象(对象级线索),并从特征交互与模型优化角度探索多模态关系(即对象、音频与问题)。对于前者,我们提出一种端到端面向对象网络,采用问题条件化线索发现模块,使音频/视觉模态聚焦于问题的关键词;同时设计模态条件化线索收集模块,突出紧密相关的音频片段或视觉对象。对于模型优化,我们提出面向对象的自适应正性学习策略,选择高语义匹配的多模态对作为正样本。具体地,我们设计两种面向对象的对比损失函数,分别识别高相关性的问题-对象对与音频-对象对。所选正样本对需约束为与不匹配对相比具有更大的相似性值。正性选择过程是自适应的,因为每帧视频中选择的正样本对可能不同。这两个面向对象的目标函数帮助模型理解哪些对象与问题精确相关,以及哪些对象正在发出声音。在MUSIC-AVQA数据集上的大量实验表明,所提方法能有效发现有利的音频-视觉线索,并取得新的最优问答性能。