Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.
翻译:多语言自动语音识别模型(如Whisper)在高资源语言上表现优异,但在达罗毗荼语系上的词错误率(WER)显著高于印度-雅利安语系。通过语言学和数据集分析,我们揭示了达罗毗荼语系具有更长的词汇长度、更高的词汇多样性和更低的重复率,导致标记分布稀疏且频繁出现字符级替代错误。基线微调进一步暴露出解码器中自注意力(语言上下文)与交叉注意力(声学线索)之间的不平衡。尽管合成标记重复实验表明其可能带来性能提升,但实际应用不可行。基于这些观察,我们提出两种解码器级优化:加权注意力机制(Weighted-Attention),可自适应平衡注意力来源;自我条件机制(Self-Conditioning),通过重新注入中间预测提升标记一致性。实验表明,该方法持续降低了低资源及黏着语的词错误率。