SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.

翻译：不同模态（如音频和视觉信息）的整合在人类对周围环境的感知中起着关键作用。近期研究在设计音视频语音分离的融合模块方面取得了显著进展，但这些工作主要集中于位于网络顶层或底层的多模态融合架构，而非全面考虑网络中不同层级位置的多模态融合。本文提出了一种名为自注意力与交叉注意力网络（SCANet）的新模型，利用注意力机制实现高效的音视频特征融合。SCANet包含两种注意力模块：自注意力（SA）模块和交叉注意力（CA）模块，其中CA模块分布于SCANet的顶层（TCA）、中层（MCA）和底层（BCA）。这些模块既能保持学习模态特定特征的能力，又能从音视频特征中提取不同语义信息。在三个标准音视频分离基准（LRS2、LRS3和VoxCeleb2）上的综合实验表明，SCANet在保持可比推理时间的同时，性能优于现有最先进方法（SOTA），验证了其有效性。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日