We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.
翻译:我们对多种印度方言及语言变体中的自发、嘈杂及语码混合语音进行了跨语言迁移的实证研究。结果表明,尽管自动语音识别(ASR)性能通常随语言间谱系距离的缩小而提升,但该因素并不能完全解释方言环境下的性能表现。在多数情况下,使用少量方言数据进行微调所达到的性能,与使用大量谱系相关的高资源标准化语言数据进行微调的效果相当。此外,我们以低资源帕哈里语变体——加瓦尔语为例进行了案例研究,并对多种当代ASR模型进行了评估。最后,我们通过分析转写错误来考察模型对预训练语言的偏向,从而进一步揭示ASR系统在处理方言及非标准化语音时所面临的挑战。