The paper introduces Diff-Filter, a multichannel speech enhancement approach based on the diffusion probabilistic model, for improving speaker verification performance under noisy and reverberant conditions. It also presents a new two-step training procedure that takes the benefit of self-supervised learning. In the first stage, the Diff-Filter is trained by conducting timedomain speech filtering using a scoring-based diffusion model. In the second stage, the Diff-Filter is jointly optimized with a pre-trained ECAPA-TDNN speaker verification model under a self-supervised learning framework. We present a novel loss based on equal error rate. This loss is used to conduct selfsupervised learning on a dataset that is not labelled in terms of speakers. The proposed approach is evaluated on MultiSV, a multichannel speaker verification dataset, and shows significant improvements in performance under noisy multichannel conditions.
翻译:本文提出了一种基于扩散概率模型的多通道语音增强方法Diff-Filter,用于改善噪声和混响条件下的说话人验证性能。同时,本文还提出了一种新的两阶段训练流程,利用了自监督学习的优势。在第一阶段,Diff-Filter通过基于评分函数的扩散模型进行时域语音滤波训练;在第二阶段,Diff-Filter与预训练的ECAPA-TDNN说话人验证模型在自监督学习框架下进行联合优化。我们提出了一种基于等错误率的新型损失函数,该损失用于在未标注说话人标签的数据集上进行自监督学习。所提方法在多通道说话人验证数据集MultiSV上进行了评估,结果表明其在噪声多通道条件下性能显著提升。