Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.68\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.
翻译:视听问答(AVQA)是一项复杂的多模态推理任务,要求智能系统基于音视频输入对准确回答自然语言查询。然而,现有的AVQA方法容易过度学习数据集中的偏见,导致鲁棒性不足。此外,当前数据集可能无法为这些方法提供精确诊断。为应对这些挑战,首先,我们提出一个新数据集——\textit{MUSIC-AVQA-R},其构建分两步:对公开数据集(\textit{MUSIC-AVQA})测试集内的问句进行重述,随后引入分布偏移对问句进行划分。前者形成大规模多样化的测试空间,后者则对罕见、频繁及整体问题实现全面的鲁棒性评估。其次,我们提出一种鲁棒架构,利用多面循环协同去偏策略来克服偏见学习。实验结果表明,该架构在两个数据集上均达到最优性能,尤其在所提数据集上获得9.68%的显著提升。通过对这两个数据集进行充分消融实验,验证了去偏策略的有效性。此外,通过在我们的数据集上进行评估,我们强调了现有多模态问答方法鲁棒性有限的不足。