Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. Although mining deeper layers of audio-visual information to interact with questions facilitates the multimodal fusion process, the redundancy of audio-visual parameters tends to reduce the generalization of the inference engine to multiple question-answer pairs in a single video. Indeed, the natural heterogeneous relationship between audiovisuals and text makes the perfect fusion challenging, to prevent high-level audio-visual semantics from weakening the network's adaptability to diverse question types, we propose a framework for performing mutual correlation distillation (MCD) to aid question inference. MCD is divided into three main steps: 1) firstly, the residual structure is utilized to enhance the audio-visual soft associations based on self-attention, then key local audio-visual features relevant to the question context are captured hierarchically by shared aggregators and coupled in the form of clues with specific question vectors. 2) Secondly, knowledge distillation is enforced to align audio-visual-text pairs in a shared latent space to narrow the cross-modal semantic gap. 3) And finally, the audio-visual dependencies are decoupled by discarding the decision-level integrations. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs, i.e., Music-AVQA and AVQA. Experiments show that our method outperforms other state-of-the-art methods, and one interesting finding behind is that removing deep audio-visual features during inference can effectively mitigate overfitting. The source code is released at http://github.com/rikeilong/MCD-forAVQA.
翻译:音视频问答(AVQA)需结合视频内容与听觉信息,并通过问题关联预测最精确答案。尽管挖掘音视频深层信息与问题交互有助于多模态融合过程,但音视频参数的冗余性往往会降低推理引擎对单个视频中多个问答对的泛化能力。事实上,音视频与文本之间天然的异质性关系使得完美融合面临挑战,为防止高层音视频语义削弱网络对不同问题类型的适应性,我们提出一种执行互相关蒸馏(MCD)的框架以辅助问题推理。MCD包含三个主要步骤:1)首先利用残差结构增强基于自注意力的音视频软关联,随后通过共享聚合器分层捕获与问题上下文相关的关键局部音视频特征,并以线索形式与特定问题向量耦合;2)其次,通过知识蒸馏将音视频-文本对在共享潜在空间中对齐,以缩小跨模态语义差距;3)最后,通过舍弃决策级整合实现音视频依赖关系的解耦。我们在两个包含多问答对的公开数据集(Music-AVQA与AVQA)上评估了所提方法。实验表明,我们的方法优于当前最优方法,其中一个有趣的发现是推理过程中移除深层音视频特征可有效缓解过拟合。源代码已发布于http://github.com/rikeilong/MCD-forAVQA。