We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.
翻译:我们研究了影响自动语音识别(ASR)模型性能的语言因素。我们假设正字法复杂度和音系复杂度均会降低识别准确率。为验证这一假设,我们在25种语言、15种书写系统上对多语言自监督预训练模型Wav2Vec2-XLSR-53进行微调,并比较其ASR准确率、字素数量、一元字素熵、表意性(书写系统中编码词/词素层面信息的程度)以及音素数量。结果表明,正字法复杂度与低ASR准确率显著相关,而音系复杂度未表现出显著相关性。