Recent Audio-Visual Question Answering (AVQA) methods rely on complete visual and audio input to answer questions accurately. However, in real-world scenarios, issues such as device malfunctions and data transmission errors frequently result in missing audio or visual modality. In such cases, existing AVQA methods suffer significant performance degradation. In this paper, we propose a framework that ensures robust AVQA performance even when a modality is missing. First, we propose a Relation-aware Missing Modal (RMM) generator with Relation-aware Missing Modal Recalling (RMMR) loss to enhance the ability of the generator to recall missing modal information by understanding the relationships and context among the available modalities. Second, we design an Audio-Visual Relation-aware (AVR) diffusion model with Audio-Visual Enhancing (AVE) loss to further enhance audio-visual features by leveraging the relationships and shared cues between the audio-visual modalities. As a result, our method can provide accurate answers by effectively utilizing available information even when input modalities are missing. We believe our method holds potential applications not only in AVQA research but also in various multi-modal scenarios.
翻译:现有的视听问答方法通常依赖于完整的视觉和音频输入来准确回答问题。然而,在实际应用场景中,设备故障和数据传输错误等问题经常导致音频或视觉模态的缺失。在这种情况下,现有的视听问答方法会出现显著的性能下降。本文提出了一种框架,即使在模态缺失的情况下也能保证稳健的视听问答性能。首先,我们提出了一种关系感知缺失模态生成器,并设计了关系感知缺失模态召回损失,通过理解可用模态之间的关系和上下文来增强生成器对缺失模态信息的召回能力。其次,我们设计了一种关系感知视听扩散模型,并引入视听增强损失,通过利用视听模态之间的关系和共享线索来进一步增强视听特征。因此,即使在输入模态缺失的情况下,我们的方法也能通过有效利用可用信息提供准确的答案。我们相信该方法不仅在视听问答研究中具有应用潜力,也可推广至多种多模态应用场景。