As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results revealed noticeable disparities in the model's robustness across subgroups. We believe that SRB will significantly facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.
翻译:随着自动语音识别(ASR)模型日益普及,确保其在物理与数字世界中存在的干扰条件下仍能做出可靠预测至关重要。本文提出语音鲁棒性基准(Speech Robust Bench, SRB),这是一个用于评估ASR模型应对多样化干扰的鲁棒性的综合性基准。SRB包含114种输入扰动,模拟了ASR模型在真实场景部署时可能遭遇的各类异构干扰。我们利用SRB对多种前沿ASR模型进行鲁棒性评估,发现模型规模以及某些建模选择(如使用离散表示或自训练技术)似乎有助于提升鲁棒性。我们进一步扩展分析,测量了ASR模型在不同人口统计子群体(即英语与西班牙语使用者、男性与女性)数据上的鲁棒性。结果显示模型在各子群体间的鲁棒性存在显著差异。我们相信SRB将通过实现更全面、可比较的鲁棒性评估,显著推动未来面向鲁棒ASR模型的研究。