This paper addresses the challenge of speaker separation, which remains an active research topic despite the promising results achieved in recent years. These results, however, often degrade in real recording conditions due to the presence of noise, echo, and other interferences. This is because neural models are typically trained on synthetic datasets consisting of mixed audio signals and their corresponding ground truths, which are generated using computer software and do not fully represent the complexities of real-world recording scenarios. The lack of realistic training sets for speaker separation remains a major hurdle, as obtaining individual sounds from mixed audio signals is a nontrivial task. To address this issue, we propose a novel method for constructing a realistic training set that includes mixture signals and corresponding ground truths for each speaker. We evaluate this dataset on a deep learning model and compare it to a synthetic dataset. We got a 1.65 dB improvement in Scale Invariant Signal to Distortion Ratio (SI-SDR) for speaker separation accuracy in realistic mixing. Our findings highlight the potential of realistic training sets for enhancing the performance of speaker separation models in real-world scenarios.
翻译:本文探讨了说话人分离这一挑战性课题,该领域尽管近年来取得了显著进展,但仍是当前研究的热点。然而,现有方法在真实录音环境中的性能往往因噪声、回声及其他干扰的存在而下降。这主要是因为神经网络模型通常使用合成数据集进行训练,这些数据集由混合音频信号及其对应的真实分离信号构成,且多通过计算机软件生成,未能充分体现真实录音场景的复杂性。缺乏真实场景的训练集仍是说话人分离领域的主要障碍,因为从混合音频信号中获取独立的说话人信号本身是一项艰巨任务。为解决这一问题,我们提出了一种构建真实训练集的新方法,该数据集包含混合信号及每位说话人对应的真实分离信号。我们在深度学习模型上评估了该数据集,并与合成数据集进行了对比。在真实混合场景的说话人分离任务中,我们在尺度不变信噪比(SI-SDR)指标上获得了1.65 dB的提升。我们的研究结果凸显了真实训练集在提升说话人分离模型实际应用性能方面的潜力。