Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench
翻译:语言技术的评判应基于其在真实应用场景中的有效性。自然语言处理(NLP)研究与评估中一个常被忽视的方面,是以非标准方言或语言变体(以下简称“变体”)形式存在的语言变异现象。大多数 NLP 基准测试仅局限于标准语言变体。为填补这一空白,我们提出了 DIALECTBENCH——首个面向语言变体的大规模 NLP 基准测试。它汇集了涵盖广泛任务类型的变体数据集(包含 10 个文本层面任务,覆盖 281 种语言变体),从而能够对 NLP 系统在不同语言变体上的性能进行全面评估。我们提供了充分的证据,表明标准与非标准语言变体之间存在性能差异,并识别出跨任务性能差异显著的语言聚类。我们相信,DIALECTBENCH 为当前语言变体 NLP 的发展现状提供了全面视图,并推动该领域向前迈进了一步。代码/数据:https://github.com/ffaisal93/DialectBench