Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

Automatic Speech Recognition (ASR) systems exhibit the best performance on speech that is similar to that on which it was trained. As such, underrepresented varieties including regional dialects, minority-speakers, and low-resource languages, see much higher word error rates (WERs) than those varieties seen as 'prestigious', 'mainstream', or 'standard'. This can act as a barrier to incorporating ASR technology into the annotation process for large-scale linguistic research since the manual correction of the erroneous automated transcripts can be just as time and resource consuming as manual transcriptions. A deeper understanding of the behaviour of an ASR system is thus beneficial from a speech technology standpoint, in terms of improving ASR accuracy, and from an annotation standpoint, where knowing the likely errors made by an ASR system can aid in this manual correction. This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes. Specifically, how particular phonetic realisations which were rare or absent in the system's training data can lead to phoneme level misrecognitions and contribute to higher WERs. It is demonstrated that the behaviour of the ASR is systematic and consistent across speakers with similar spoken varieties (in this case the same L1) and phoneme substitution errors are typically in agreement with human annotators. By identifying problematic productions specific weaknesses can be addressed by sourcing such realisations for training and fine-tuning thus making the system more robust to pronunciation variation.

翻译：自动语音识别（ASR）系统在其训练数据相似的语音上表现最佳。因此，那些代表性不足的语种变体（包括区域方言、少数群体语言及低资源语言）的词错误率（WER）远高于被视为“权威”、“主流”或“标准”的变体。这可能会阻碍将ASR技术应用于大规模语言研究的标注流程，因为对自动化转录结果中错误部分的人工修正，其时间和资源消耗与纯人工转录相差无几。因此，从语音技术角度（提升ASR准确性）和标注角度（了解ASR系统可能产生的错误有助于人工修正）而言，深入理解ASR系统的行为具有双重价值。本研究展示了一种探测ASR系统的方法，旨在揭示其如何处理多种二语英语（L2 English）中的语音变异。具体而言，探究训练数据中罕见或缺失的特定语音实现方式如何导致音素级误识别，进而促使词错误率升高。实验表明，ASR的行为具有系统性，在具有相似口语变体（此处指相同母语L1）的说话者间表现一致，且其音素替换错误通常与人工标注者判断相符。通过识别问题性发音，可针对性地获取此类语音实现用于训练和微调，从而增强系统对发音变异的鲁棒性。