Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $ \text{Sep-TFAnet}^{\text{VAD}}$, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of $ \text{Sep-TFAnet}^{\text{VAD}}$ and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io

翻译：语音分离涉及从多说话人音频信号中提取单个说话人的语音。现实环境中多说话人可能同时交谈的复杂性日益增加，凸显了有效语音分离技术的重要性。本文提出了一种面向噪声和混响环境的、采用时频注意力机制的单麦克风说话人分离网络，我们将此新架构命名为分离时频注意力网络（Sep-TFAnet）。此外，我们还提出了该分离网络的变体，记为 $\text{Sep-TFAnet}^{\text{VAD}}$，该变体将语音活动检测器（VAD）集成到分离网络中。分离模块基于受Conv-Tasnet架构启发、并经多次修改的时序卷积网络（TCN）主干。我们没有采用可学习的编码器和解码器，而是分别使用短时傅里叶变换（STFT）和逆短时傅里叶变换（iSTFT）进行分析与合成。本系统专为人机交互场景开发，并支持在线模式。我们在多种声学条件下评估并深入分析了 $\text{Sep-TFAnet}^{\text{VAD}}$ 和 Sep-TFAnet 的分离能力，证明了它们相较于竞争方法的优势。由于在模拟数据上训练的分离网络往往在真实录音上表现不佳，我们还展示了所提出方案能够更好地泛化到我们声学实验室中由人形机器人录制的现实示例。项目页面：https://Sep-TFAnet.github.io