Person or identity verification has been recently gaining a lot of attention using audio-visual fusion as faces and voices share close associations with each other. Conventional approaches based on audio-visual fusion rely on score-level or early feature-level fusion techniques. Though existing approaches showed improvement over unimodal systems, the potential of audio-visual fusion for person verification is not fully exploited. In this paper, we have investigated the prospect of effectively capturing both the intra- and inter-modal relationships across audio and visual modalities, which can play a crucial role in significantly improving the fusion performance over unimodal systems. In particular, we introduce a recursive fusion of a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework in a recursive fashion to progressively refine the feature representations that can efficiently capture the intra-and inter-modal relationships. To further enhance the audio-visual feature representations, we have also explored BLSTMs to improve the temporal modeling of audio-visual feature representations. Extensive experiments are conducted on the Voxceleb1 dataset to evaluate the proposed model. Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships across audio and visual modalities.
翻译:近年来,利用音视频融合进行身份或个体验证备受关注,因为人脸与声音之间存在紧密关联。传统基于音视频融合的方法通常依赖于分数级或早期特征级融合技术。尽管现有方法相较单模态系统有所改进,但音视频融合在身份验证中的潜力尚未完全挖掘。本文研究了有效捕捉音频与视觉模态内部及跨模态关系的可能性,这种关系对于显著提升融合性能(相较于单模态系统)具有关键作用。具体而言,我们提出了一种联合交叉注意力模型的递归融合方法,其中在交叉注意力框架中以递归方式运用音频-视觉联合特征表示,逐步优化能高效捕获模态内与跨模态关系的特征表示。为进一步增强音视频特征表示,我们还探索了利用BLSTM改善音视频特征表示的时序建模能力。基于Voxceleb1数据集开展了大量实验以评估所提模型,结果表明该模型通过精准捕获音频与视觉模态间的内模态及跨模态关系,在融合性能上展现出显著提升。