Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.32\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.
翻译:视听问答(AVQA)是一项复杂的多模态推理任务,要求智能系统基于音频-视频输入对准确回答自然语言查询。然而,现有的AVQA方法容易过度学习数据集中的偏见,导致鲁棒性较差。此外,当前的数据集可能无法为这些方法提供精确的诊断。为解决这些问题,首先,我们提出一个新数据集,名为MUSIC-AVQA-R,其构建分为两步:对公开数据集(MUSIC-AVQA)的测试划分中的问题进行改写,随后引入分布偏移以划分问题。前者生成了较大且多样化的测试空间,而后者则实现了对罕见、常见及整体问题的全面鲁棒性评估。其次,我们提出一个鲁棒的架构,利用多方位循环协同去偏策略来克服偏见学习。实验结果表明,该架构在两个数据集上均达到了最先进的性能,尤其在我们提出的数据集上取得了9.32%的显著提升。为验证去偏策略的有效性,我们在该两个数据集上进行了广泛的消融实验。此外,通过在我们的数据集上评估,我们突显了现有模态问答方法鲁棒性有限的不足。