The popularity of automatic speech recognition (ASR) systems nowadays leads to an increasing need for improving their accessibility. Handling stuttering speech is an important feature for accessible ASR systems. To improve the accessibility of ASR systems for stutterers, we need to expose and analyze the failures of ASR systems on stuttering speech. The speech datasets recorded from stutterers are not diverse enough to expose most of the failures. Furthermore, these datasets lack ground truth information about the non-stuttered text, rendering them unsuitable as comprehensive test suites. Therefore, a methodology for generating stuttering speech as test inputs to test and analyze the performance of ASR systems is needed. However, generating valid test inputs in this scenario is challenging. The reason is that although the generated test inputs should mimic how stutterers speak, they should also be diverse enough to trigger more failures. To address the challenge, we propose ASTER, a technique for automatically testing the accessibility of ASR systems. ASTER can generate valid test cases by injecting five different types of stuttering. The generated test cases can both simulate realistic stuttering speech and expose failures in ASR systems. Moreover, ASTER can further enhance the quality of the test cases with a multi-objective optimization-based seed updating algorithm. We implemented ASTER as a framework and evaluated it on four open-source ASR models and three commercial ASR systems. We conduct a comprehensive evaluation of ASTER and find that it significantly increases the word error rate, match error rate, and word information loss in the evaluated ASR systems. Additionally, our user study demonstrates that the generated stuttering audio is indistinguishable from real-world stuttering audio clips.
翻译:如今自动语音识别系统的普及,使得提升其可访问性的需求日益增长。处理口吃语音是构建可访问语音识别系统的重要特性。为改善口吃者的语音识别可访问性,我们需要暴露并分析系统在处理口吃语音时的失效问题。现有口吃者语音数据集多样性不足,难以揭示大多数失效情况。此外,这些数据集缺乏非口吃文本的标注信息,不适合作为综合测试套件。因此需要开发一种生成口吃语音测试输入的方法,用以测试分析语音识别系统性能。但在此场景下生成有效测试输入面临挑战:生成的测试输入既需模仿口吃者的发音模式,又需具备足够多样性以触发更多失效。为此,我们提出ASTER——一种自动测试语音识别系统可访问性的技术。该技术通过注入五种不同类型的口吃特征生成有效测试用例,既可模拟真实口吃语音,又能暴露语音识别系统的失效模式。此外,ASTER采用基于多目标优化的种子更新算法进一步提升测试用例质量。我们将ASTER实现为框架,在四个开源语音识别模型和三个商业语音识别系统上进行评估。全面评估结果表明,ASTER显著增加了评估系统的词错误率、匹配错误率和词信息损失。用户研究进一步证实,生成的音频与真实口吃音频片段难以区分。