Personalized speech enhancement (PSE) has shown convincing results when it comes to extracting a known target voice among interfering ones. The corresponding systems usually incorporate a representation of the target voice within the enhancement system, which is extracted from an enrollment clip of the target voice with upstream models. Those models are generally heavy as the speaker embedding's quality directly affects PSE performances. Yet, embeddings generated beforehand cannot account for the variations of the target voice during inference time. In this paper, we propose to perform on-thefly refinement of the speaker embedding using a tiny speaker encoder. We first introduce a novel contrastive knowledge distillation methodology in order to train a 150k-parameter encoder from complex embeddings. We then use this encoder within the enhancement system during inference and show that the proposed method greatly improves PSE performances while maintaining a low computational load.
翻译:个性化语音增强(PSE)在从干扰语音中提取已知目标语音方面已展现出令人信服的效果。相应系统通常在增强系统中整合目标语音的表征,该表征通过上游模型从目标语音的注册片段中提取。由于说话人嵌入质量直接影响PSE性能,这些模型通常较为庞大。然而,预先生成的嵌入无法适应推理过程中目标语音的变化。本文提出使用微型说话人编码器对说话人嵌入进行实时精炼。我们首先引入一种新颖的对比知识蒸馏方法,以从复杂嵌入中训练参数量为15万的编码器。随后在推理阶段将该编码器集成到增强系统中,实验表明所提方法在保持低计算负载的同时,显著提升了PSE性能。