UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for $\textbf{u}$nsupervised $\textbf{n}$eural $\textbf{s}$peech $\textbf{s}$eparation by leveraging $\textbf{o}$ver-determined training mixtu$\textbf{r}$es. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered estimates of all the speakers can add up to the mixture to satisfy the above constraint. We show that this loss can promote unsupervised separation of speakers. The linear filters are computed in each sub-band based on the mixture and DNN estimates through the forward convolutive prediction (FCP) algorithm. To address the frequency permutation problem incurred by using sub-band FCP, a loss term based on minimizing intra-source magnitude scattering is proposed. Although UNSSOR requires over-determined training mixtures, we can train DNNs to achieve under-determined separation (e.g., unsupervised monaural speech separation). Evaluation results on two-speaker separation in reverberant conditions show the effectiveness and potential of UNSSOR.

翻译：摘要：在存在多个同时说话人的混响条件下，每个麦克风在不同位置采集到多个说话人的混合信号。当麦克风数量多于说话人数量的过定条件下，我们可以将解限定为说话人图像，并通过利用每个混合信号作为约束条件（即，麦克风处估计的说话人图像应叠加为混合信号）实现无监督语音分离。基于这一见解，我们提出UNSSOR算法，这是一种通过利用$\textbf{过定训练混合}$实现$\textbf{无监督神经语音分离}$的方法。在每个训练步骤中，我们将输入混合信号馈入深度神经网络（DNN）以生成每个说话人的中间估计，对估计进行线性滤波，并优化损失函数，使得在每个麦克风处，所有说话人的滤波后估计可叠加为混合信号以满足上述约束。我们证明该损失能促进说话人的无监督分离。这些线性滤波器通过前向卷积预测（FCP）算法基于每个子带内的混合信号和DNN估计计算得出。为解决子带FCP引起的频率排列问题，我们提出基于最小化源内幅度分散的损失项。尽管UNSSOR需要过定训练混合，但我们可以训练DNN实现欠定分离（例如，无监督单声道语音分离）。在混响条件下双说话人分离的评估结果显示了UNSSOR的有效性和潜力。