Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
翻译:音视频语音分离方法旨在融合不同模态生成高质量分离语音,从而提升语音识别等下游任务的性能。现有最先进模型大多在时域中运行,但其对声学特征的建模方式过于简单,往往需要更大、计算更复杂的模型才能达到最优性能。本文提出一种新颖的时频域音视频语音分离方法:递归时频分离网络(RTFS-Net),该方法对短时傅里叶变换产生的复杂时频单元进行算法处理。通过沿时间和频率维度分别使用多层RNN,我们独立建模并捕获音频的时频特性。此外,我们引入一种独特的基于注意力的融合技术,用于高效整合音视频信息,并提出一种新型掩码分离方法,利用声学特征固有的频谱特性实现更清晰的分离。RTFS-Net仅使用先前最优方法10%的参数和18%的MACs即超越其性能。这是首个在时频域中超越所有同期时域方法的音视频语音分离技术。