Recent studies demonstrate the effectiveness of Self Supervised Learning (SSL) speech representations for Speech Inversion (SI). However, applying SI in real-world scenarios remains challenging due to the pervasive presence of background noise. We propose a unified framework that integrates Speech Enhancement (SE) and SI models through shared SSL-based speech representations. In this framework, the SSL model is trained not only to support the SE module in suppressing noise but also to produce representations that are more informative for the SI task, allowing both modules to benefit from joint training. At a Signal-to-Noise Ratio of -5 db, our method for the SI task achieves relative improvements over the baseline of 80.95% under babble noise and 38.98% under non-babble noise, as measured by the average Pearson product-moment correlation across all estimated parameters.
翻译:近期研究表明,自监督学习语音表征在语音反演任务中具有显著效果。然而,由于现实场景中普遍存在的背景噪声,语音反演的实际应用仍面临挑战。本文提出一种统一框架,通过共享基于自监督学习的语音表征,将语音增强与语音反演模型进行集成。在该框架中,自监督学习模型不仅被训练以支持语音增强模块抑制噪声,同时生成对语音反演任务更具信息量的表征,使得两个模块能够从联合训练中共同受益。在信噪比为-5分贝的条件下,本方法在语音反演任务中取得显著提升:相较于基线方法,在混叠噪声环境下所有估计参数的平均皮尔逊积矩相关系数相对提升80.95%,在非混叠噪声环境下相对提升38.98%。