We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
翻译:我们提出了一个针对单通道语音分离任务的上界,该界基于关于语音短时片段性质的假设。利用该界,我们能够证明:尽管近期方法在少数说话人场景下取得了显著进展,但在五人和十人说话人场景下仍有改进空间。随后,我们引入了一个深度神经网络SepIt,它通过迭代方式逐步提升不同说话人的估计精度。在测试阶段,SepIt会根据分析中提出的互信息准则,为每个测试样本动态调整迭代次数。在大量实验中,SepIt在2、3、5和10位说话人场景下均超越了当前最先进的神经网络。