Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.
翻译:尽管身份或个体验证主要依赖于人脸、声音等单一模态进行探索,但音视频融合技术近年来展现出超越单模态方法的巨大潜力。音频与视觉模态通常被认为具有强互补关系,这对实现有效的音视频融合至关重要。然而,两者并非始终呈现强互补性,也可能表现出弱互补关系,导致音视频特征表示效果不佳。本文提出一种动态交叉注意力(DCA)模型,该模型能根据音视频模态间的强/弱互补关系,动态选择交叉注意力特征或非注意力特征。具体而言,我们设计了一个条件门控层,用于评估交叉注意力机制的贡献:仅当模态间呈现强互补关系时选择交叉注意力特征,否则采用非注意力特征。在Voxceleb1数据集上的大量实验证明了所提模型的鲁棒性。结果表明,该模型在多种交叉注意力变体上均能持续提升性能,同时优于现有最优方法。