Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.

翻译：数据密集型自动语音识别技术在应用于构音障碍与老年说话人语音时，面临着与健康非老年语音不匹配、数据稀缺以及说话人间差异显著等挑战。为此，本文提出两种新颖的数据高效方法，用于学习同质化的构音障碍与老年说话人特征，以实现DNN/TDNN及Conformer语音识别模型的快速实时测试时自适应。这些方法包括：1）说话人级方差正则化谱基嵌入特征，其利用特殊正则化项来增强自适应过程中说话人特征的同质性；2）基于特征的学习隐藏单元贡献变换，该变换以VR-SBE特征为条件。实验在跨两种语言的四个任务上进行：英语UASpeech和TORGO构音障碍语音数据集，英语DementiaBank Pitt和粤语JCCOCC MoCA老年语音语料库。所提出的实时说话人自适应技术持续优于基线iVector和xVector自适应，取得了最高达5.32%绝对值（18.57%相对值）的词错误率或字错误率统计显著降低；同时优于批量模式LHUC说话人自适应，取得了2.24%绝对值（9.20%相对值）的错误率降低，且在自适应过程中实时因子相比xVectors最高可加速33.6倍。通过与当前语音识别技术（包括在UASpeech上的SSL预训练系统）的比较，验证了所提自适应技术的有效性，我们的最佳系统在UASpeech上取得了23.33%的当前最优词错误率。分析表明，VR-SBE特征和f-LHUC变换对测试时自适应中的说话人级数据量不敏感。T-SNE可视化显示，它们比基线iVector、xVector及批量模式LHUC变换具有更强的说话人级同质性。