Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.
翻译:在嘈杂声学环境中从多个说话者中分离出目标说话者的语音是一项具有挑战性的任务。个性化语音增强(PSE)旨在通过利用说话者声音的先验知识来实现这一目标。近年来的研究已产生有前景的PSE模型,但这些模型通常伴随计算密集型架构,难以应用于资源受限的嵌入式设备。本文提出一种新颖的轻量级双阶段语音增强(SE)模型个性化方法,并在以先进性能著称的DeepFilterNet2中实现该方法。我们探索说话者信息在模型中的最优整合方式,研究说话者嵌入在双阶段增强架构中不同位置的集成策略,并针对DeepFilterNet2适配PSE任务时制定专门的训练方案。实验表明,本个性化方法在保持最小计算开销的前提下,显著提升了DeepFilterNet2的性能表现。