Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks. However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific residual adapters. We experiment with 4 accents and choose automatic speech recognition (ASR) as the downstream task of interest. We obtain strong word error rate reductions (WERR) over HuBERT-large for all 4 accents, with a mean WERR of 22.7% with accent-specific adapters and a mean WERR of 25.1% if the entire encoder is accent-adapted. While our experiments utilize HuBERT and ASR as the downstream task, our proposed approach is both model and task-agnostic.
翻译:通过自监督方式从海量无标签语音语料库中学习的语音表征,已成功适配至多项下游任务。然而,这类表征可能偏向于语料库的标准数据特征,在非典型非母语口音群体上表现不佳。本文以当前最优的HuBERT模型为基线,提出并研究了一种参数高效的语音表征自监督适配方法,通过训练口音特异性残差适配器实现面向此类群体的适配。我们对4种口音开展实验,选择自动语音识别(ASR)作为目标下游任务。结果表明,相较于HuBERT-large模型,所有4种口音均获得显著的词错误率降低(WERR):口音特异性适配器平均WERR达22.7%,而全编码器口音适配的平均WERR可达25.1%。尽管实验采用HuBERT与ASR作为下游任务,我们提出的方法兼具模型无关性与任务无关性。