Deaf or hard-of-hearing (DHH) speakers typically have atypical speech caused by deafness. With the growing support of speech-based devices and software applications, more work needs to be done to make these devices inclusive to everyone. To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models typically do not perform well on DHH speech, we provide a thorough analysis of creating personalized ASR systems. We collected a large DHH speaker dataset of four speakers totaling around 28.05 hours and thoroughly analyzed the performance of different training frameworks by varying the training data sizes. Our findings show that 1000 utterances (or 1-2 hours) from a target speaker can already significantly improve the model performance with minimal amount of work needed, thus we recommend researchers to collect at least 1000 utterances to make an efficient personalized ASR system. In cases where 1000 utterances is difficult to collect, we also discover significant improvements in using previously proposed data augmentation techniques such as intermediate fine-tuning when only 200 utterances are available.
翻译:聋哑或听障(DHH)说话者通常因耳聋而产生非典型语音。随着基于语音的设备与软件应用的普及,需进一步推动这些设备对所有人的包容性。为此,我们利用公开可用的自动语音识别(ASR)工具,结合日语DHH说话者数据集进行了分析。由于现成ASR模型通常难以有效处理DHH语音,我们系统性地剖析了构建个性化ASR系统的关键环节。通过收集包含四名说话者、总计约28.05小时的DHH语音数据集,并改变训练数据规模,深入分析了不同训练框架的性能表现。研究结果表明,目标说话者1000个语音片段(约1-2小时)即可显著提升模型性能,且所需工作量最小化,因此建议研究者至少收集1000个片段以构建高效的个性化ASR系统。在难以收集1000个语音片段的情况下,我们亦发现当仅有200个片段可用时,采用此前提出的数据增强技术(如中间微调)同样能实现显著性能提升。