Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench
翻译:语言技术的实用性应以其在真实场景中的表现来评判。自然语言处理(NLP)研究与评估中一个常被忽视的方面是非标准方言或语言变体(以下简称变体)所呈现的语言差异。现有NLP基准测试多局限于标准语言变体。为填补这一空白,我们提出DIALECTBENCH——首个面向语言变体的大规模NLP基准测试。该基准汇聚了涵盖多任务的广泛变体数据集(10项文本级任务,覆盖281种变体),从而实现对NLP系统在不同语言变体上表现的综合评估。我们提供了标准变体与非标准变体间性能差异的充分证据,并识别出跨任务性能差异显著的语系集群。我们相信DIALECTBENCH为当前语言变体领域的NLP研究现状提供了全景视图,并推动该领域迈向新阶段。代码/数据:https://github.com/ffaisal93/DialectBench