Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world applications. In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow's redundancy-reduction principle. The new framework improves the HuBERT training algorithm by introducing auxiliary losses that drive the self- and cross-correlation matrix between pairwise noise-distorted embeddings towards identity matrix. This encourages the model to produce noise-agnostic speech representations. With this method, we report improved robustness in noisy environments, including unseen noises, without impairing the performance on the clean set.
翻译:现有的自监督预训练语音模型为利用大规模无标注语料库构建高效自动语音识别(ASR)系统提供了有效途径。然而,目前多数模型采用单一来源的干净语料进行训练,导致其在测试阶段存在噪声干扰时性能显著下降。对于真实场景应用而言,克服噪声的不利影响至关重要。本文受H. Barlow冗余降低原理启发,提出了一种名为deHuBERT的新型训练框架,用于实现噪声抑制编码。该框架通过引入辅助损失函数,驱动成对噪声扰动嵌入向量间的自相关与互相关矩阵趋向单位矩阵,从而改进HuBERT训练算法。这促使模型生成噪声无关的语音表征。实验表明,该方法在噪声环境(包括未知噪声类型)中展现出更强的鲁棒性,同时保持了对干净语音集的性能不退化。