Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets.
翻译:说话人自适应技术为定制个性化自动语音识别(ASR)系统提供了有效解决方案。然而,无监督的基于模型的说话人自适应技术在海量数据驱动的端到端ASR系统中的实际应用,受到说话人级别数据稀缺以及对转录错误性能敏感性的制约。为解决这些问题,本文采用一组紧凑且数据高效的说话人相关(SD)参数表征,以促进最先进的Conformer ASR系统的说话人自适应训练及测试时的无监督说话人自适应。通过基于置信度分数选择说话人级别自适应数据中错误率较低的子集,降低了对监督质量的敏感性。本文提出两种轻量级置信度分数估计模块,以产生更可靠的置信度分数。针对数据选择加剧的数据稀疏性问题,采用贝叶斯学习方法对SD参数的不确定性进行建模。在基准300小时Switchboard和233小时AMI数据集上的实验表明,所提出的基于置信度分数的自适应方案持续优于基准的说话人无关(SI)Conformer模型,以及未使用说话人数据选择的传统非贝叶斯点估计自适应方法。在采用外部Transformer和LSTM语言模型重评分后,仍保持一致的性能提升。具体而言,在300小时Switchboard语料库上,与基准SI Conformer相比,在NIST Hub5'00、RT02和RT03评估集上分别获得了1.0%、1.3%和1.4%的绝对词错误率(WER)降低(相对降低9.5%、10.9%和11.3%)。在AMI开发集和评估集上,也分别获得了2.7%和3.3%的绝对WER降低(相对降低8.9%和10.2%)。