The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.
翻译:不同模态(如音频和视觉信息)的整合在人类对周围环境的感知中起着关键作用。近期研究在设计音视频语音分离的融合模块方面取得了显著进展,但这些工作主要集中于位于网络顶层或底层的多模态融合架构,而非全面考虑网络中不同层级位置的多模态融合。本文提出了一种名为自注意力与交叉注意力网络(SCANet)的新模型,利用注意力机制实现高效的音视频特征融合。SCANet包含两种注意力模块:自注意力(SA)模块和交叉注意力(CA)模块,其中CA模块分布于SCANet的顶层(TCA)、中层(MCA)和底层(BCA)。这些模块既能保持学习模态特定特征的能力,又能从音视频特征中提取不同语义信息。在三个标准音视频分离基准(LRS2、LRS3和VoxCeleb2)上的综合实验表明,SCANet在保持可比推理时间的同时,性能优于现有最先进方法(SOTA),验证了其有效性。