Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.
翻译:构音障碍语音的自动识别至今仍是一项极具挑战性的任务。神经运动障碍及伴生的身体残疾导致大规模数据采集困难,阻碍了ASR系统的开发。通过数据密集型参数微调将SSL预训练ASR模型适配到有限的构音障碍语音,会导致泛化能力不足。为此,本文对各种数据增强方法进行了广泛比较研究,以提升预训练ASR模型对构音障碍语音微调的鲁棒性。这些方法包括:a) 常规的说话者无关扰动受损语音;b) 基于说话者依赖的速度扰动,或基于GAN的对抗扰动,利用平行构音障碍语音对正常对照语音进行时间对齐后施加扰动;c) 新型基于谱基GAN的对抗数据增强方法,适用于非平行数据。在UASpeech语料库上的实验表明:相比未使用数据增强和速度扰动的微调Wav2vec2.0和HuBERT模型,基于GAN的数据增强在不同数据扩展操作点上均取得了统计显著的词错误率(WER)降低,在包含16位构音障碍说话者的UASpeech测试集上分别实现绝对降低2.01%和0.96%(相对降低9.03%和4.63%)。经跨系统输出重评分后,最优系统在UASpeech上取得了当前公开发表的最低WER为16.53%(极低清晰度语音为46.47%)。