Speech separation is very important in real-world applications such as human-machine interaction, hearing aids devices, and automatic meeting transcription. In recent years, a significant improvement occurred towards the solution based on deep learning. In fact, much attention has been drawn to supervised learning methods using synthetic mixtures datasets despite their being not representative of real-world mixtures. The difficulty in building a realistic dataset led researchers to use unsupervised learning methods, because of their ability to handle realistic mixtures directly. The results of unsupervised learning methods are still unconvincing. In this paper, a method is introduced to create a realistic dataset with ground truth sources for speech separation. The main challenge in designing a realistic dataset is the unavailability of ground truths for speakers signals. To address this, we propose a method for simultaneously recording two speakers and obtaining the ground truth for each. We present a methodology for benchmarking our realistic dataset using a deep learning model based on Bidirectional Gated Recurrent Units (BGRU) and clustering algorithm. The experiments show that our proposed dataset improved SI-SDR (Scale Invariant Signal to Distortion Ratio) by 1.65 dB and PESQ (Perceptual Evaluation of Speech Quality) by approximately 0.5. We also evaluated the effectiveness of our method at different distances between the microphone and the speakers and found that it improved the stability of the learned model.
翻译:语音分离在人机交互、助听设备及自动会议转录等实际应用中具有重要意义。近年来,基于深度学习的解决方案取得了显著进展。事实上,尽管合成混合数据集无法代表真实场景的混合情况,但大量研究仍聚焦于使用此类数据集的监督学习方法。由于构建真实数据集的困难,研究者转而采用无监督学习方法以直接处理真实混合场景,但此类方法的成果仍缺乏说服力。本文提出了一种构建包含真实标注源的语音分离数据集的方法。设计真实数据集的主要挑战在于无法获取说话人信号的标注信息。为解决该问题,我们提出了一种同时录制双说话人并获取各自真实标注的方法。我们展示了基于双向门控循环单元与聚类算法的深度学习模型对该真实数据集的基准测试方法。实验表明,所提数据集将SI-SDR(尺度不变信号失真比)提升了1.65 dB,PESQ(语音质量感知评估)提升了约0.5。我们还评估了不同麦克风-说话人距离下方法的有效性,发现其能提升学习模型的稳定性。