Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.
翻译:将前端语音增强模型与基于自监督学习的语音模型相结合,对于噪声环境下的下游任务具有显著效果。语音增强模型通常通过自监督学习表示进行微调,采用增强语音与纯净语音之间的均方误差损失函数。然而,均方误差损失容易利用自监督学习模型中的位置嵌入,使得目标函数可以通过位置相关性而非内容相关信息实现最小化。本研究将该问题构建为自监督表示微调的一般性局限,并通过表示引导的语音增强进行探究。我们考虑了两种策略:(1)零填充方法——该方法先前已在自监督学习预训练阶段得到探索,本文则将其置于微调场景中进行考察;(2)结合软动态时间规整损失的语速扰动方法。实验表明,基于软动态时间规整的方法实现了更快的收敛速度与更优的下游任务性能,从而印证了在基于自监督学习的语音建模中实施位置不变性微调的重要性。