Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
翻译:监督式语音增强方法已取得显著成功。然而,在实际应用场景中,纯净语音数据往往匮乏,因此需要一种基于自监督学习的语音增强方法,该方法不仅能够提供可比的增强性能,还能应用于其他语音相关的下游任务。本研究提出了一种基于掩蔽自编码器的通用语音增强器,该增强器对影响语音的失真类型具有无关性,能同时处理多种失真,并以自监督方式进行训练。我们通过增强处理堆栈对含噪输入数据施加额外失真,掩蔽自编码器模型在预训练过程中学习消除这些附加失真并重建频谱图的掩蔽区域。随后,利用预训练得到的嵌入特征,通过少量配对数据微调模型以适应特定下游任务。我们在去噪和去混响下游任务中评估了预训练特征的有效性,探究了预训练增强堆栈中不同增强策略(如单说话人与多说话人场景)以及不同含噪输入特征表示(如$log1p$压缩)对预训练嵌入特征及下游微调增强性能的影响。实验结果表明,所提方法不仅优于基线系统,在领域内和跨领域评估数据集上均达到了最先进的性能水平。