It is a well-known fact that current AI-based language technology -- language models, machine translation systems, multilingual dictionaries and corpora -- focuses on the world's 2-3% most widely spoken languages. Recent research efforts have attempted to expand the coverage of AI technology to `under-resourced languages.' The goal of our paper is to bring attention to a phenomenon that we call linguistic bias: multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages. Linguistic bias is manifested in uneven per-language performance even in the case of similar test conditions. We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented, and that can even become ethically problematic as they disregard valuable aspects of diversity as well as the needs of the language communities themselves. As our attempt at building diversity-aware language resources, we present a new initiative that aims at reducing linguistic bias through both technological design and methodology, based on an eye-level collaboration with local communities.
翻译:众所周知,当前基于人工智能的语言技术——包括语言模型、机器翻译系统、多语言词典与语料库——主要聚焦于全球使用最广泛的2%-3%的语言。近年来的研究尝试将人工智能技术的覆盖范围拓展至"资源匮乏语言"。本文旨在引起学界对"语言偏倚"这一现象的重视:多语言处理系统往往存在一种固化的、通常是无意且隐蔽的、对特定语言的表征偏好。即便在相近的测试条件下,语言偏倚也表现为各语言性能的不均衡。我们论证,这种有偏倚的技术往往源于未能充分尊重被表征语言复杂性的研究方法论,甚至可能因忽视多样性的重要维度以及语言社群自身需求而引发伦理问题。作为构建多样性敏感语言资源的尝试,我们提出一项新倡议,旨在通过与当地社群的平等协作,从技术设计与方法论两个层面减少语言偏倚。