This paper studies audio-visual noise suppression for egocentric videos -- where the speaker is not captured in the video. Instead, potential noise sources are visible on screen with the camera emulating the off-screen speaker's view of the outside world. This setting is different from prior work in audio-visual speech enhancement that relies on lip and facial visuals. In this paper, we first demonstrate that egocentric visual information is helpful for noise suppression. We compare object recognition and action classification-based visual feature extractors and investigate methods to align audio and visual representations. Then, we examine different fusion strategies for the aligned features, and locations within the noise suppression model to incorporate visual information. Experiments demonstrate that visual features are most helpful when used to generate additive correction masks. Finally, in order to ensure that the visual features are discriminative with respect to different noise types, we introduce a multi-task learning framework that jointly optimizes audio-visual noise suppression and video-based acoustic event detection. This proposed multi-task framework outperforms the audio-only baseline on all metrics, including a 0.16 PESQ improvement. Extensive ablations reveal the improved performance of the proposed model with multiple active distractors, overall noise types, and across different SNRs.
翻译:本文研究第一人称视频中的视听联合噪声抑制问题——在此类视频中,说话者并未出现在画面中。相反,潜在噪声源在屏幕上可见,摄像头模拟了画外说话者观察外部世界的视角。这一场景与依赖唇部及面部视觉信息的传统音视频语音增强研究存在本质差异。本文首先证明第一人称视觉信息对噪声抑制具有辅助作用,通过对比基于物体识别与动作分类的视觉特征提取器,探究音频与视觉表征的对齐方法。进而研究对齐特征的不同融合策略,以及噪声抑制模型中视觉信息的嵌入位置。实验表明,将视觉特征用于生成加性修正掩码时效果最佳。最后,为确保视觉特征对不同噪声类型具有判别力,我们引入多任务学习框架,联合优化视听噪声抑制与基于视频的声学事件检测。该多任务框架在所有指标上均优于纯音频基线,其中PESQ提升0.16。广泛的消融实验揭示了所提模型在存在多个主动干扰源、各类噪声类型及不同信噪比条件下的性能提升。