Historically, researchers and consumers have noticed a decrease in quality when applying NLP tools to minority variants of languages (i.e. Puerto Rican Spanish or Swiss German), but studies exploring this have been limited to a select few languages. Additionally, past studies have mainly been conducted in a monolingual context, so cross-linguistic trends have not been identified and tied to external factors. In this work, we conduct a comprehensive evaluation of the most influential, state-of-the-art large language models (LLMs) across two high-use applications, machine translation and automatic speech recognition, to assess their functionality on the regional dialects of several high- and low-resource languages. Additionally, we analyze how the regional dialect gap is correlated with economic, social, and linguistic factors. The impact of training data, including related factors like dataset size and its construction procedure, is shown to be significant but not consistent across models or languages, meaning a one-size-fits-all approach cannot be taken in solving the dialect gap. This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
翻译:历史上,研究者和用户已注意到将自然语言处理工具应用于语言少数变体(如波多黎各西班牙语或瑞士德语)时质量下降的现象,但相关研究仅局限于少数特定语言。此外,过往研究主要基于单语言背景开展,因此尚未识别出跨语言趋势并将其与外部因素相关联。本研究对最具影响力的前沿大规模语言模型(LLMs)在机器翻译和自动语音识别这两个高应用场景下的性能进行全面评估,考察其针对多种高资源和低资源语言区域方言的功能表现。同时,我们分析了区域方言差距与经济、社会及语言因素之间的关联性。训练数据(包括数据集规模及其构建流程等相关因素)的影响被证明显著,但该影响在模型或语言间并不一致,这意味着解决方言差距不能采用"一刀切"的方法。本研究通过揭示明显的差异并识别通过审慎数据收集解决这些问题的潜在路径,将为推动方言自然语言处理领域的发展奠定基础。