We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources. We utilise ECAPA-TDNN target speaker embeddings and condition the score function alternately on the SDE time embedding and the target speaker embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we show that fine-tuning a pre-trained DiffSpEx model to a specific speaker further improves performance, enabling personalisation in target speaker extraction.
翻译:本文提出DiffSpEx——一种基于随机微分方程的得分生成建模的生成式目标说话人提取方法。DiffSpEx在复数短时傅里叶变换域部署连续时间随机扩散过程,该过程从目标说话人声源起始,收敛至以混合声源为中心的高斯分布。在反向时间过程中,参数化得分函数以目标说话人嵌入为条件,从混合声源中提取目标说话人。我们采用ECAPA-TDNN目标说话人嵌入,并交替对得分函数施加SDE时间嵌入与目标说话人嵌入条件。通过WSJ0-2mix数据集验证,DiffSpEx可实现12.9dB的SI-SDR与3.56的NISQA评分。此外,研究表明对预训练DiffSpEx模型进行特定说话人微调可进一步提升性能,实现目标说话人提取的个性化定制。