TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

from arxiv, In IEEE/ACM Transactions on Audio, Speech, and Language Processing. A sound demo is available at https://zqwang7.github.io/demos/TF-GridNet-demo/index.html, and the code is available at https://github.com/espnet/espnet/pull/5395

We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this paper is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge.

翻译：我们提出用于语音分离的TF-GridNet模型。该模型是一种新颖的深度神经网络（DNN），在时频域中融合了全频带与子频带建模。其堆叠了多个模块，每个模块包含帧内全频带模块、子频带时序模块以及跨帧自注意力模块。模型通过训练实现复数频谱映射——将输入信号的实部与虚部（RI）分量堆叠为特征，以预测目标RI分量。首先在单声道消音说话人分离任务上进行评估：在不使用数据增强和动态混合的情况下，该模型在标准双说话人分离数据集WSJ0-2mix上的尺度不变信号失真比（SI-SDR）取得了23.5 dB的业界最优提升。为验证其对噪声和混响的鲁棒性，我们使用SMS-WSJ数据集评估单声道混响说话人分离性能，并采用WHAMR!数据集评估含噪混响说话人分离性能，在两个数据集上均获得业界最优结果。进一步地，通过多麦克风复数频谱映射将TF-GridNet扩展至多麦克风场景，并集成到包含波束成形器的双DNN系统（早期研究中称为MISO-BF-MISO）中——本文提出的波束成形器是基于首个DNN输出计算的新型多帧维纳滤波器。在SMS-WSJ和WHAMR!的多通道任务上同样获得了业界最优性能。除说话人分离外，我们将所提出的算法应用于语音去混响及含噪混响语音增强，在去混响数据集和近期L3DAS22多通道语音增强挑战赛数据集上均取得了业界最优结果。