A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. The models are trained on both supervised and unsupervised training data, and are tested on real AMI recordings containing overlapping speech. To objectively evaluate our models, we also use a synthetic multi-channel AMI test set. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest improvement to SI-SNR and to human listening ratings across synthetic and real datasets, outperforming supervised models trained on well-matched synthetic data. Our results demonstrate that unsupervised learning through MixIT enables model adaptation on both single- and multi-channel real-world speech recordings.
翻译:机器学习的一个关键挑战是将训练数据泛化到感兴趣的应用领域。本工作将近期提出的混合不变训练(MixIT)算法推广至多通道场景下的无监督学习。我们使用MixIT在AMI语料库中远场麦克风阵列记录的混叠混响与噪声语音上训练模型。这些模型同时在有监督与无监督训练数据上训练,并在包含重叠语音的真实AMI录音上进行测试。为客观评估模型,我们还使用合成的多通道AMI测试集。保持网络架构不变,我们发现微调后的半监督模型在合成与真实数据集上对SI-SNR及人类听感评分提升最大,优于在匹配良好的合成数据上训练的有监督模型。结果表明,基于MixIT的无监督学习能够实现模型对单通道与多通道真实语音录音的自适应。