Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
翻译:音视频语音分离方法旨在整合不同模态以生成高质量分离语音,从而提升语音识别等下游任务的性能。现有最先进模型多在时域运行,但其对声学特征的建模方式过于简单,往往需借助更庞大且计算量密集的模型才能达到最优性能。本文提出一种新颖的时频域音视频语音分离方法——递归时频分离网络(RTFS-Net),该方法对短时傅里叶变换产生的复数时频单元进行算法处理。我们沿时间与频率维度分别采用多层递归神经网络独立建模并捕捉音频的时频特征。此外,我们引入基于注意力的独创融合技术以高效整合音视频信息,并提出一种新型掩码分离方法,利用声学特征固有的频谱特性实现更清晰的分离效果。RTFS-Net仅用先前最优方法10%的参数和18%的乘累加操作即超越其性能。这是首个在时频域表现优于所有同期时域方法的音视频语音分离技术。