Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.
翻译:尽管身份验证主要依赖面部和声音等单一模态进行探索,但音视频融合近期展现出超越单模态方法的巨大潜力。音频与视觉模态通常被认为具有强互补关系,这对有效融合至关重要。然而,这两种模态并非始终形成强互补,也可能呈现弱互补关系,导致音视频特征表示质量不佳。本文提出一种动态交叉注意力(Dynamic Cross-Attention, DCA)模型,该模型能够根据音视频模态间的强/弱互补关系,实时动态选择交叉注意力特征或非注意力特征。具体而言,我们设计了一个条件门控层来评估交叉注意力机制的贡献,仅当特征呈现强互补关系时选择交叉注意力特征,否则采用非注意力特征。在Voxceleb1数据集上的大量实验证明了所提模型的鲁棒性。结果表明,该模型在多种交叉注意力变体上均实现了性能提升,同时优于现有最先进方法。