Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet's MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.
翻译:近期研究在音视频语音分离的融合模块设计上取得了显著进展。然而,这些方法主要集中在单一时间尺度上对听觉与视觉特征进行多模态融合,且未采用选择性注意力机制,这与大脑的工作方式形成鲜明对比。为解决这一问题,我们提出了一种名为模态内与模态间注意力网络(IIANet)的新型模型,该模型利用注意力机制实现高效的音视频特征融合。IIANet由两类注意力块组成:模态内注意力(IntraA)块与模态间注意力(InterA)块,其中InterA块分布于IIANet的顶层、中层和底层。受人类大脑在不同时间尺度上选择性聚焦相关内容方式的启发,这些块既保持了学习模态特定特征的能力,又能够从音视频特征中提取不同语义信息。在三个标准音视频分离基准数据集(LRS2、LRS3和VoxCeleb2)上的综合实验表明,IIANet在保持可比推理时间的同时,性能超越了现有最先进方法。特别地,IIANet的快速版本(IIANet-fast)仅需CTCNet 7%的MACs,在CPU上比CTCNet快40%,同时实现了更优的分离质量,充分展现了注意力机制在高效多模态融合方面的巨大潜力。