State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches. We describe a wide range of NLP tasks in terms of two categories: natural language understanding (NLU) (for tasks such as dialect classification, sentiment analysis, parsing, and NLU benchmarks) and natural language generation (NLG) (for summarisation, machine translation, and dialogue systems). The survey is also broad in its coverage of languages which include English, Arabic, German among others. We observe that past work in NLP concerning dialects goes deeper than mere dialect classification, and . This includes early approaches that used sentence transduction that lead to the recent approaches that integrate hypernetworks into LoRA. We expect that this survey will be useful to NLP researchers interested in building equitable language technologies by rethinking LLM benchmarks and model architectures.
翻译:最先进的自然语言处理(NLP)模型在大规模训练语料库上训练,并在评估数据集上表现出卓越性能。本综述深入探讨这些数据集的一个重要属性:语言方言。鉴于NLP模型在方言数据集上性能下降及其对语言技术公平性的影响,我们从数据集和方法角度综述了方言NLP领域的既往研究。我们按两个类别描述广泛范围的NLP任务:自然语言理解(NLU)(包括方言分类、情感分析、句法分析和NLU基准测试等任务)与自然语言生成(NLG)(涵盖摘要生成、机器翻译和对话系统)。本综述还广泛覆盖了英语、阿拉伯语、德语等多种语言。我们观察到,方言相关的NLP研究远不止简单的方言分类,早期基于句子转导的方法已演进至近期将超网络集成到LoRA中的技术。我们期望本综述能对致力于通过重新思考大语言模型基准和模型架构来构建公平语言技术的NLP研究者有所裨益。