Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such experience problem is wrong speaker extraction (called speaker confusion, SC), which leads to strong negative experience and hampers effective conversations. To mitigate the imperative SC issue, we reformulate the training objective and propose two novel loss schemes that explore the metric of reconstruction improvement performance defined at small chunk-level and leverage the metric associated distribution information. Both loss schemes aim to encourage a TSE network to pay attention to those SC chunks based on the said distribution information. On this basis, we present X-SepFormer, an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer. Experimental results on the benchmark WSJ0-2mix dataset validate the effectiveness of our proposals, showing consistent improvements on SC errors (by 14.8% relative). Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.81, our best system significantly outperforms the current SOTA systems and offers the top TSE results reported till date on the WSJ0-2mix.
翻译:目标语音提取(TSE)系统旨在从多说话人混合音频中提取目标语音。以往多数TSE网络常用的训练目标是提升提取语音波形的重建性能。然而,已有研究表明,具有高重建性能的TSE系统在实际应用中仍可能面临低质量体验问题。其中一种体验问题是错误说话人提取(即说话人混淆,SC),这会导致强烈的负面体验并妨碍有效对话。为缓解这一迫切的SC问题,我们重新设计了训练目标,提出了两种新型损失方案,分别探索以小片段级定义的重建改进性能度量,并利用与该度量相关的分布信息。两种损失方案均旨在引导TSE网络基于上述分布信息关注那些发生SC的片段。在此基础上,我们提出了X-SepFormer,一种端到端TSE模型,结合了所提出的损失方案与SepFormer骨干网络。在标准WSJ0-2mix数据集上的实验结果验证了我们方法的有效性,在SC错误率上实现了14.8%的相对一致改善。此外,我们的最优系统以19.4 dB的SI-SDRi和3.81的PESQ显著优于当前最先进系统,并在WSJ0-2mix数据集上提供了截至目前公开报道的最佳TSE结果。