Dysarthric speech recognition (DSR) research has witnessed remarkable progress in recent years, evolving from the basic understanding of individual words to the intricate comprehension of sentence-level expressions, all driven by the pressing communication needs of individuals with dysarthria. Nevertheless, the scarcity of available data remains a substantial hurdle, posing a significant challenge to the development of effective sentence-level DSR systems. In response to this issue, dysarthric data augmentation (DDA) has emerged as a highly promising approach. Generative models are frequently employed to generate training data for automatic speech recognition tasks. However, their effectiveness hinges on the ability of the synthesized data to accurately represent the target domain. The wide-ranging variability in pronunciation among dysarthric speakers makes it extremely difficult for models trained on data from existing speakers to produce useful augmented data, especially in zero-shot or one-shot learning settings. To address this limitation, we put forward a novel text-coverage strategy specifically designed for text-matching data synthesis. This innovative strategy allows for efficient zero/one-shot DDA, leading to substantial enhancements in the performance of DSR when dealing with unseen dysarthric speakers. Such improvements are of great significance in practical applications, including dysarthria rehabilitation programs and day-to-day common-sentence communication scenarios.
翻译:近年来,构音障碍语音识别研究取得了显著进展,从对单个词语的基本理解发展到对句子级表达的复杂理解,这一进展始终由构音障碍患者的迫切沟通需求所驱动。然而,可用数据的稀缺性仍然是一个重大障碍,对开发有效的句子级构音障碍语音识别系统构成了显著挑战。针对这一问题,构音障碍数据增强已成为一种极具前景的解决方案。生成模型常被用于为自动语音识别任务生成训练数据。然而,其有效性取决于合成数据能否准确表征目标领域。构音障碍说话者发音的广泛变异性,使得基于现有说话者数据训练的模型极难生成有效的增强数据,尤其是在零样本或单样本学习场景中。为克服这一局限,我们提出了一种专为文本匹配数据合成设计的新型文本覆盖策略。这一创新策略能够实现高效的零/单样本构音障碍数据增强,从而在处理未见过的构音障碍说话者时,显著提升构音障碍语音识别的性能。此类改进在构音障碍康复计划及日常常用句沟通场景等实际应用中具有重要意义。