With the wide adoption of automated speech recognition (ASR) systems, it is increasingly important to test and improve ASR systems. However, collecting and executing speech test cases is usually expensive and time-consuming, motivating us to strategically prioritize speech test cases. A key question is: how to determine the ideal order of collecting and executing speech test cases to uncover more errors as early as possible? Each speech test case consists of a piece of audio and the corresponding reference text. In this work, we propose PROPHET (PRiOritizing sPeecH tEsT), a tool that predicts potential error-uncovering speech test cases only based on their reference texts. Thus, PROPHET analyzes test cases and prioritizes them without running the ASR system, which can analyze speech test cases at a large scale. We evaluate 6 different prioritization methods on 3 ASR systems and 12 datasets. Given the same testing budget, we find that our approach uncovers 12.63% more wrongly recognized words than the state-of-the-art method. We select test cases from the prioritized list to fine-tune ASR systems and analyze how our approach can improve the ASR system performance. Statistical tests show that our proposed method can bring significantly larger performance improvement to ASR systems than the existing baseline methods. Furthermore, we perform correlation analysis and confirm that fine-tuning an ASR system using a dataset, on which the model performs worse, tends to improve the performance more.
翻译:随着自动语音识别(ASR)系统的广泛应用,对其进行测试和改进变得日益重要。然而,收集和执行语音测试用例通常成本高昂且耗时,这促使我们采用策略性优先级排序方法。关键问题在于:如何确定语音测试用例的最优收集与执行顺序,以便尽早发现更多错误?每个语音测试用例包含一段音频及其对应的参考文本。本研究提出PROPHET(语音测试优先级排序工具),该工具仅基于测试用例的参考文本即可预测其潜在的错误发现能力。因此,PROPHET无需运行ASR系统即可分析并排序测试用例,从而实现大规模语音测试用例分析。我们在3个ASR系统和12个数据集上评估了6种不同的优先级排序方法。在相同测试预算下,我们的方法比现有最优方法多识别出12.63%的错误识别词汇。我们从优先级排序后的测试列表中选取用例对ASR系统进行微调,并分析该方法对系统性能的提升效果。统计检验表明,与现有基线方法相比,本文方法能为ASR系统带来显著更优的性能提升。进一步的相关性分析证实:使用模型表现更差的数据集微调ASR系统,往往能带来更大的性能改善。