State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches. We describe a wide range of NLP tasks in terms of two categories: natural language understanding (NLU) (for tasks such as dialect classification, sentiment analysis, parsing, and NLU benchmarks) and natural language generation (NLG) (for summarisation, machine translation, and dialogue systems). The survey is also broad in its coverage of languages which include English, Arabic, German among others. We observe that past work in NLP concerning dialects goes deeper than mere dialect classification, and . This includes early approaches that used sentence transduction that lead to the recent approaches that integrate hypernetworks into LoRA. We expect that this survey will be useful to NLP researchers interested in building equitable language technologies by rethinking LLM benchmarks and model architectures.
翻译:最先进的自然语言处理(NLP)模型在大规模训练语料库上进行训练,并在评估数据集上报告出卓越性能。本综述深入探讨了这些数据集的一个重要属性:语言方言。鉴于NLP模型在方言数据集上的性能下降及其对语言技术公平性的影响,我们系统梳理了以往针对方言的NLP研究,涵盖数据集与方法两个维度。我们从自然语言理解(NLU)(包括方言分类、情感分析、句法分析及NLU基准测试等任务)和自然语言生成(NLG)(涵盖文本摘要、机器翻译及对话系统)两大类别,描述了广泛的NLP任务。本综述还广泛覆盖了英语、阿拉伯语、德语等多种语言。我们观察到,以往关于方言的NLP研究并非仅局限于方言分类,其内容更为深入。这包括早期使用句子转导的方法,以及近期将超网络集成至LoRA的技术。我们期望本综述能对致力于通过重新审视大语言模型(LLM)基准测试与模型架构来构建公平语言技术的NLP研究者有所助益。