In recent years, there has been a growing emphasis on the intersection of audio, vision, and text modalities, driving forward the advancements in multimodal research. However, strong bias that exists in any modality can lead to the model neglecting the others. Consequently, the model's ability to effectively reason across these diverse modalities is compromised, impeding further advancement. In this paper, we meticulously review each question type from the original dataset, selecting those with pronounced answer biases. To counter these biases, we gather complementary videos and questions, ensuring that no answers have outstanding skewed distribution. In particular, for binary questions, we strive to ensure that both answers are almost uniformly spread within each question category. As a result, we construct a new dataset, named MUSIC-AVQA v2.0, which is more challenging and we believe could better foster the progress of AVQA task. Furthermore, we present a novel baseline model that delves deeper into the audio-visual-text interrelation. On MUSIC-AVQA v2.0, this model surpasses all the existing benchmarks, improving accuracy by 2% on MUSIC-AVQA v2.0, setting a new state-of-the-art performance.
翻译:近年来,音频、视觉和文本模态之间的交叉研究日益受到重视,推动了多模态研究的进展。然而,任何模态中存在的严重偏差都可能导致模型忽略其他模态,从而削弱模型跨模态有效推理的能力,阻碍其进一步发展。本文仔细审查了原始数据集中的每种问题类型,筛选出存在显著答案偏差的问题。为对抗这些偏差,我们收集了互补视频及相应问题,确保答案分布无突出偏斜。特别是对于二元问题,我们力求使每个问题类别中的两种答案几乎均匀分布。由此,我们构建了一个更具挑战性的新数据集——MUSIC-AVQA v2.0,我们认为该数据集能更好地促进视听问答(AVQA)任务的发展。此外,我们提出了一个新的基线模型,深入探索了音频-视觉-文本之间的相互关系。在MUSIC-AVQA v2.0上,该模型超越了所有现有基准,准确率提升了2%,实现了新的最优性能。