Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
翻译:视听语音分离方法旨在融合不同模态以生成高质量分离语音,从而提升语音识别等下游任务的性能。现有最优(SOTA)模型大多在时域中运行,但其对声学特征的建模方式过于简单,往往需要更大且计算密集的模型才能达到SOTA性能。本文提出一种新颖的时频域视听语音分离方法:递归时频分离网络(RTFS-Net),该方法对短时傅里叶变换产生的复值时频单元进行处理。我们通过沿时间和频率维度分别应用多层RNN,对音频的时域和频域维度进行独立建模与捕获。此外,我们引入了一种基于注意力机制的独特融合技术以实现音频与视觉信息的高效整合,并提出一种新的掩码分离方法,该方案利用声学特征的内在频谱特性实现更清晰的分离。RTFS-Net仅使用前SOTA方法10%的参数和18%的MACs即达到更优性能。这是首个在所有同时代时域方法中表现更优的时频域视听语音分离方法。