Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. In healthcare settings, communication breakdowns reduce the quality of care. While building an augmentative and alternative communication (AAC) tool to enable fluid communication we found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction. English dysarthric ASR performance is often evaluated on the TORGO dataset. Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. After reducing prompt-overlap, results with SOTA ASR models produce extremely high word error rates for speakers with mild and severe dysarthria. Furthermore, to improve ASR, our work looks at the impact of n-gram language models and large-language model (LLM) based multi-modal generative error-correction algorithms like Whispering-LLaMA for a second pass ASR. Our work highlights how much more needs to be done to improve ASR for atypical speakers to enable equitable healthcare access both in-person and in e-health settings.
翻译:脑瘫(CP)和肌萎缩侧索硬化症(ALS)患者常面临发音困难,导致构音障碍并产生非典型言语模式。在医疗场景中,沟通障碍会降低护理质量。在构建增强与替代沟通(AAC)工具以实现流畅沟通的过程中,我们发现Whisper和Wav2vec2.0等前沿自动语音识别(ASR)技术因训练数据缺乏而严重边缘化非典型语音使用者。本研究旨在利用前沿ASR技术结合领域特异性纠错方法。英语构音障碍ASR性能常基于TORGO数据集进行评估,但该数据集存在训练集与测试集说话者间语句重叠的已知问题。我们提出一种算法以消除此类提示重叠。在减少提示重叠后,前沿ASR模型对轻度与重度构音障碍使用者仍产生极高的词错误率。为改进ASR性能,本研究进一步探讨了n-元语言模型及基于大语言模型(LLM)的多模态生成式纠错算法(如Whispering-LLaMA)在二次ASR处理中的影响。本研究凸显了为促进线下及电子健康场景中的公平医疗可及性,针对非典型语音使用者的ASR技术仍亟待大幅改进。