Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues

In this work, we focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE). The facial region, encompassing the lip region, reflects additional speech-related attributes such as gender, skin color, nationality, etc., which contribute to the effectiveness of AVSE. However, static and dynamic speech-unrelated attributes also exist, causing appearance changes during speech. To address these challenges, we propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE. Specifically, we introduce a spatial attention-based visual encoder to capture and enhance visual speech information beyond the lip region, incorporating global facial context and automatically ignoring speech-unrelated information for robust visual feature extraction. Additionally, a dynamic visual feature fusion strategy is introduced by integrating a temporal-dimensional self-attention module, enabling the model to robustly handle facial variations. The acoustic noise in the speaking process is variable, impacting audio quality. Therefore, a dynamic fusion strategy for both audio and visual features is introduced to address this issue. By integrating cooperative dual attention in the visual encoder and audio-visual fusion strategy, our model effectively extracts beneficial speech information from both audio and visual cues for AVSE. Thorough analysis and comparison on different datasets, including normal and challenging cases with unreliable or absent visual information, consistently show our model outperforming existing methods across multiple metrics.

翻译：本文聚焦于利用唇部区域之外的面部线索，以实现稳健的视听语音增强（AVSE）。面部区域（含唇部区域）反映了与语音相关的附加属性（如性别、肤色、国籍等），这些属性对AVSE的有效性具有促进作用。然而，语音无关的静态与动态属性同样存在，会导致语音过程中的面部外观变化。为应对这些挑战，我们提出了一种双注意力协同框架DualAVSE，其核心在于：抑制语音无关信息、通过面部线索捕获语音相关信息，并将其与音频信号动态融合以完成AVSE。具体而言，我们设计了一种基于空间注意力的视觉编码器，用以捕获并增强唇部区域之外的视觉语音信息，通过整合全局面部上下文并自动忽略语音无关信息，实现稳健的视觉特征提取。此外，通过引入时序维度的自注意力模块，我们提出了一种动态视觉特征融合策略，使模型能够稳健处理面部动态变化。由于说话过程中的声学噪声具有可变性，会影响音频质量，为此我们进一步引入了音频与视觉特征的动态融合策略。通过将视觉编码器中的协同双注意力机制与视听融合策略进行整合，本模型能够从音频与视觉线索中有效提取有益语音信息用于AVSE。在不同数据集（涵盖正常场景以及视觉信息不可靠或缺失的挑战性场景）上的全面分析与对比表明，本模型在多项指标上均显著优于现有方法。