UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for $\textbf{u}$nsupervised $\textbf{n}$eural $\textbf{s}$peech $\textbf{s}$eparation by leveraging $\textbf{o}$ver-determined training mixtu$\textbf{r}$es. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered estimates of all the speakers can add up to the mixture to satisfy the above constraint. We show that this loss can promote unsupervised separation of speakers. The linear filters are computed in each sub-band based on the mixture and DNN estimates through the forward convolutive prediction (FCP) algorithm. To address the frequency permutation problem incurred by using sub-band FCP, a loss term based on minimizing intra-source magnitude scattering is proposed. Although UNSSOR requires over-determined training mixtures, we can train DNNs to achieve under-determined separation (e.g., unsupervised monaural speech separation). Evaluation results on two-speaker separation in reverberant conditions show the effectiveness and potential of UNSSOR.

翻译：在存在多个同时说话人的混响条件下，每个麦克风在不同位置采集到多个说话人的混合信号。当麦克风数量超过说话人数量的超定条件下，我们可以通过将每个混合信号作为约束条件（即，每个麦克风处的估计说话人图像应叠加为混合信号），缩小说话人图像的求解范围，从而实现无监督语音分离。基于这一见解，我们提出UNSSOR算法——一种通过利用超定训练混合实现$\textbf{u}$无监督$\textbf{n}$神经$\textbf{s}$语音$\textbf{s}$分离的算法。在每个训练步骤中，我们将输入混合信号馈入深度神经网络（DNN）以生成每个说话人的中间估计，对估计结果进行线性滤波，并优化损失函数，使得在每个麦克风处，所有说话人的滤波后估计信号能够叠加为混合信号，从而满足上述约束。我们证明该损失函数能够促进说话人的无监督分离。线性滤波器基于混合信号和DNN估计结果，通过前向卷积预测（FCP）算法在每个子带中计算得出。为解决子带FCP引起的频率排列问题，我们提出了一种基于最小化源内幅度散度的损失项。尽管UNSSOR需要超定训练混合，但我们可以训练DNN实现欠定分离（例如，无监督单声道语音分离）。在混响条件下对双说话人分离的评估结果展示了UNSSOR的有效性和潜力。