Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations(i.e., the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as positivity. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand which objects are exactly relevant to the question and which are making sounds. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance.

翻译：本文聚焦于音频-视觉问答（AVQA）任务，旨在从未经剪辑的可听视频中回答相关问题。为生成准确答案，AVQA模型需找出与问题最相关的信息性音频-视觉线索。本文提出显式考虑视频帧中的细粒度视觉对象（对象级线索），并从特征交互与模型优化角度探索多模态关系（即对象、音频与问题）。对于前者，我们提出一种端到端面向对象网络，采用问题条件化线索发现模块，使音频/视觉模态聚焦于问题的关键词；同时设计模态条件化线索收集模块，突出紧密相关的音频片段或视觉对象。对于模型优化，我们提出面向对象的自适应正性学习策略，选择高语义匹配的多模态对作为正样本。具体地，我们设计两种面向对象的对比损失函数，分别识别高相关性的问题-对象对与音频-对象对。所选正样本对需约束为与不匹配对相比具有更大的相似性值。正性选择过程是自适应的，因为每帧视频中选择的正样本对可能不同。这两个面向对象的目标函数帮助模型理解哪些对象与问题精确相关，以及哪些对象正在发出声音。在MUSIC-AVQA数据集上的大量实验表明，所提方法能有效发现有利的音频-视觉线索，并取得新的最优问答性能。