Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively.
翻译:自动识别障碍性及老年语音至今仍是一项极具挑战性的任务,主要因为大规模收集此类数据存在困难。本文探索了一系列方法,将领域自适应自监督学习(SSL)预训练模型集成到基于TDNN和Conformer的语音识别(ASR)系统中,用于构音障碍及老年语音识别:a)标准声学前端与领域自适应wav2vec2.0语音表示之间的输入特征融合;b)分别使用标准声学特征及额外wav2vec2.0特征单独训练的TDNN系统的帧级联合解码;c)通过领域自适应wav2vec2.0模型对TDNN/Conformer系统输出进行重评分的多遍解码。此外,本文还将领域自适应wav2vec2.0表示应用于声学-发音(A2A)逆变换,以构建多模态构音障碍及老年语音识别系统。在UASpeech构音障碍语料库和DementiaBank Pitt老年语音语料库上的实验表明:集成领域自适应wav2vec2.0模型的TDNN和Conformer ASR系统在两个任务上均显著优于独立的wav2vec2.0模型,词错误率(WER)分别绝对降低8.22%和3.43%(相对降低26.71%和15.88%)。在包含16位构音障碍说话者的UASpeech测试集上,获得了最低已发表WER为22.56%(极低可懂度子集为52.53%,未见词子集为39.09%);在DementiaBank Pitt测试集上,最低已发表WER为18.17%。