Self-supervised pre-trained speech models have strongly improved speech recognition, yet they are still sensitive to domain shifts and accented or atypical speech. Many of these models rely on quantisation or clustering to learn discrete acoustic units. We propose to correct the discovered discrete units for accented speech back to a standard pronunciation in an unsupervised manner. A masked language model is trained on discrete units from a standard accent and iteratively corrects an accented token sequence by masking unexpected cluster sequences and predicting their common variant. Small accent adapter blocks are inserted in the pre-trained model and fine-tuned by predicting the corrected clusters, which leads to an increased robustness of the pre-trained model towards a target accent, and this without supervision. We are able to improve a state-of-the-art HuBERT Large model on a downstream accented speech recognition task by altering the training regime with the proposed method.
翻译:自监督预训练语音模型显著提升了语音识别性能,但对领域迁移、带口音或非典型语音仍较敏感。许多此类模型依赖量化或聚类来学习离散声学单元。我们提出一种无监督方法,将带口音语音中发现的离散单元校正回标准发音。通过基于标准口音离散单元训练的掩码语言模型,迭代校正带口音的标记序列:掩盖异常聚类序列并预测其常见变体。在预训练模型中插入轻量级口音适配模块,通过预测校正后的聚类进行微调,从而在无需监督的情况下增强预训练模型对目标口音的鲁棒性。通过采用所提方法调整训练流程,我们能够在下游带口音语音识别任务上改进当前最优的HuBERT Large模型。