Accurate recognition of dysarthric and elderly speech remain challenging tasks to date. Speaker-level heterogeneity attributed to accent or gender, when aggregated with age and speech impairment, create large diversity among these speakers. Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. To this end, this paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods: variance-regularized spectral basis embedding (SVR) and spectral feature driven f-LHUC transforms. Experiments conducted on UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest the proposed on-the-fly speaker adaptation approaches consistently outperform baseline iVector adapted hybrid DNN/TDNN and E2E Conformer systems by statistically significant WER reduction of 2.48%-2.85% absolute (7.92%-8.06% relative), and offline model based LHUC adaptation by 1.82% absolute (5.63% relative) respectively.
翻译:构音障碍和老年语音的准确识别至今仍是具有挑战性的任务。由口音或性别导致的说话人层面异质性,与年龄和言语障碍因素叠加后,形成了此类说话人群体的高度多样性。说话人层面数据的稀缺性限制了基于数据密集型的模型自适应方法的实际应用。为此,本文提出了两种新型数据高效的基于特征的即时说话人自适应方法:方差正则化谱基嵌入(SVR)和谱特征驱动的f-LHUC变换。在UASpeech构音障碍语音语料库和DementiaBank Pitt老年语音语料库上开展的实验表明,所提出的即时说话人自适应方法在统计显著性上持续优于基线iVector自适应混合DNN/TDNN系统和E2E Conformer系统,分别实现了2.48%-2.85%绝对词错误率降低(7.92%-8.06%相对降低),以及相较于离线模型基LHUC自适应方法1.82%绝对词错误率降低(5.63%相对降低)。