Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
翻译:自动语音识别(ASR)系统在识别其预期或期望的语言变体时,已被证实存在显著的质量差异。缓解这一问题的途径之一是通过更具代表性的数据集训练或微调模型,但该方法常受限于领域内训练与评估数据的匮乏。为此,我们提出一种新方法:利用少量领域外(长格式)非裔美国人英语(AAE)数据,提升美式英语短格式语音识别器的鲁棒性。我们采用CORAAL、YouTube及Mozilla Common Voice数据集训练音频分类器,大致判断语音片段属于AAE还是其他变体(包括主流美式英语MAE)。通过将分类器输出与粗略地理位置信息相结合,可从大规模未转录短格式查询语料库中筛选出特定子集,用于规模化半监督学习。基于该数据微调后,AAE与MAE之间的相对词错误率差异降低38.5%,且MAE识别质量未受影响。