Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. However, progress in dysarthric ASR (DASR) has been limited by high variability in dysarthric speech and limited public availability of dysarthric training data. This paper demonstrates that data augmentation using text-to-dysarthic-speech (TTDS) synthesis for finetuning large ASR models is effective for DASR. Specifically, diffusion-based text-to-speech (TTS) models can produce speech samples similar to dysarthric speech that can be used as additional training data for fine-tuning ASR foundation models, in this case Whisper. Results show improved synthesis metrics and ASR performance for the proposed multi-speaker diffusion-based TTDS data augmentation for ASR fine-tuning compared to current DASR baselines.
翻译:近年来,自动语音识别(ASR)研究取得了令人瞩目的性能,在为构音障碍人士(PwD)提供增强与替代沟通(AAC)及家庭环境系统接入方面展现出巨大潜力。然而,构音障碍语音的高变异性以及公开可用的构音障碍训练数据有限,制约了构音障碍自动语音识别(DASR)的发展。本文证明,利用文本-构音障碍语音(TTDS)合成进行数据增强,以微调大型ASR模型,对DASR是有效的。具体而言,基于扩散的文本转语音(TTS)模型能够生成与构音障碍语音相似的语音样本,这些样本可作为额外的训练数据用于微调ASR基础模型(本文以Whisper为例)。结果表明,与当前DASR基线相比,所提出的基于多说话人扩散TTDS的ASR微调数据增强方法在合成指标和ASR性能上均有所提升。