Real-time single-channel speech separation aims to unmix an audio stream captured from a single microphone that contains multiple people talking at once, environmental noise, and reverberation into multiple de-reverberated and noise-free speech tracks, each track containing only one talker. While large state-of-the-art DNNs can achieve excellent separation from anechoic mixtures of speech, the main challenge is to create compact and causal models that can separate reverberant mixtures at inference time. In this paper, we explore low-complexity, resource-efficient, causal DNN architectures for real-time separation of two or more simultaneous speakers. A cascade of three neural network modules are trained to sequentially perform noise-suppression, separation, and de-reverberation. For comparison, a larger end-to-end model is trained to output two anechoic speech signals directly from noisy reverberant speech mixtures. We propose an efficient single-decoder architecture with subtractive separation for real-time recursive speech separation for two or more speakers. Evaluation on real monophonic recordings of speech mixtures, according to speech separation measures like SI-SDR, perceptual measures like DNS-MOS, and a novel proposed channel separation metric, show that these compact causal models can separate speech mixtures with low latency, and perform on par with large offline state-of-the-art models like SepFormer.
翻译:实时单通道语音分离旨在将从单个麦克风捕获的音频流(包含多人同时说话、环境噪声和混响)解混为多个去混响、无噪声的语音轨道,每个轨道仅包含一个说话人。尽管大规模先进深度神经网络(DNN)已能在无混响语音混合场景中实现优异分离,但主要挑战在于构建紧凑且因果的模型,使其在推理时能够分离含混响的混合语音。本文探索了用于实时分离两个及以上同时说话人的低复杂度、资源高效、因果DNN架构。我们训练了一个由三个神经网络模块级联的系统,以顺序执行噪声抑制、分离和去混响任务;作为对比,同时训练了一个更大的端到端模型,使其直接从含噪混响的语音混合中输出两个无混响语音信号。针对两个及以上说话人的实时递归语音分离,我们提出了一种高效的单解码器架构,采用减法分离策略。基于真实单声道语音混合录音的评估——采用SI-SDR等语音分离指标、DNS-MOS等感知指标,以及新提出的通道分离度量——表明这些紧凑因果模型能够以低延迟实现语音混合分离,其性能与SepFormer等大规模离线先进模型相当。