Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.
翻译:尽管人物或身份验证主要依赖于单一模态(如人脸和语音)进行研究,但音视频融合近年来展现出超越单模态方法的巨大潜力。音频和视觉模态通常被认为具有强互补关系,这对有效的音视频融合至关重要。然而,它们并非始终具备强互补性——也可能呈现弱互补关系,导致音视频特征表示质量低下。本文提出一种动态交叉注意力(Dynamic Cross-Attention, DCA)模型,该模型能根据音视频模态间的强或弱互补关系,动态地实时选择交叉注意力特征或非注意力特征。具体而言,我们设计了一个条件门控层来评估交叉注意力机制的贡献:仅在跨模态特征呈现强互补关系时选择交叉注意力特征,否则采用非注意力特征。我们在Voxceleb1数据集上进行了大量实验,证明了所提模型的鲁棒性。结果表明,该模型在多个交叉注意力变体上均能持续提升性能,并超越了现有最先进方法。