In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.
翻译:在音视频问答(Audio Visual Question Answering, AVQA)任务中,音频与视觉模态的学习可划分为三个层面:1)空间层,2)时间层,3)语义层。现有AVQA方法存在两大缺陷:网络传递的音视频信息在空间与时间层面未对齐;且模态间(音频与视觉)语义信息在上下文中常失衡,导致性能不佳。本文提出一种新型端到端上下文多模态对齐网络(CAD),通过以下创新解决AVQA方法的挑战:i)引入无参数随机上下文模块,实现音频与视觉在空间层面的鲁棒对齐;ii)提出一种自监督预训练技术,实现音频与视觉在时间层面的动态对齐;iii)引入交叉注意力机制,实现音频与视觉信息在语义层面的均衡。所提CAD网络在MUSIC-AVQA数据集上平均性能较现有最优方法提升9.4%。实验证明,本文对AVQA的贡献可无缝集成至现有方法中,在不增加额外复杂度的情况下提升其性能。