Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the prior SOTA method in both inference speed and separation quality while reducing the number of parameters by 90% and MACs by 83%. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
翻译:音视觉语音分离方法旨在融合不同模态以生成高质量分离语音,从而提升语音识别等下游任务的性能。现有最先进模型主要在时域中运行,但其对声学特征的建模方式过于简化,往往需要更大且计算更密集的模型方能达到最优性能。本文提出一种新颖的时频域音视觉语音分离方法——递归时频分离网络(RTFS-Net),其算法直接作用于短时傅里叶变换产生的复数时频单元。我们沿时间和频率维度独立使用多层RNN进行建模与捕获,并引入了一种基于注意力机制的新型融合技术以实现音视觉信息的高效整合,同时提出一种利用声学特征固有频谱特性的掩码分离方法,以获得更清晰的分离效果。RTFS-Net在推理速度和分离质量上均超越先前最优方法,同时参数量减少90%,计算复杂度(MACs)降低83%。这是首个在时频域实现音视觉语音分离且全面优于同期时域方法的工作。