Word Error Rate (WER) mischaracterizes ASR models' performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models' performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.
翻译:词错误率(WER)将语音、声调及其他语言错误合并为单一词汇错误,因而无法准确表征自动语音识别模型在非洲语言上的性能。相比之下,特征错误率(FER)作为一种能够揭示模型性能中具有语言学意义错误的可行指标,近期受到广泛关注。本文通过结合字符错误率(CER)与特征错误率(FER),并引入声调感知扩展指标(TER),对三种语音编码器在两种非洲语言上进行了评估。研究表明,通过计算音系特征层面的错误,即使在词级准确率较低的情况下,FER与TER仍能揭示具有语言学显著性的错误模式。实验结果表明,模型在音段特征上表现较好,而声调特征(尤其是中调与降阶调)仍最具挑战性。约鲁巴语的评估结果显示出显著的指标差异:WER=0.788、CER=0.305、FER=0.151。同样,对于预训练数据中未包含的濒危语言乌内梅语,某模型在WER接近完全错误且CER=0.461的情况下,仍获得了相对较低的FER=0.267。这表明模型错误常可归因于个别语音特征错误,而此类信息被WER等非全即无的评估指标所掩盖。