Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
翻译:多语言语言模型(LMs)有望为更广泛的自然语言处理提供支持,然而现有系统在全球不同语言上的表现参差不齐。本文综述旨在探讨这些差距为何持续存在,以及它们反映的是内在的语言学难度还是建模过程中的技术性偏差。我们围绕两个核心问题梳理文献:语言性能差异是否源于表征与资源分配的选择(例如分词、编码、数据暴露、参数共享),而非语言固有的复杂性;以及哪些设计选择能够缓解类型学上多样语言之间的不平等。我们回顾了诸如正字法、形态学、词汇多样性、句法、信息密度和类型学距离等语言学特征,并将每个特征与具体的建模机制联系起来。当分词、编码和数据暴露等因素被标准化时,性能差距通常会缩小,这表明许多表面上的难度源于当前的建模选择。我们将这些见解综合为关于分词、采样、架构和评估的设计建议,以支持构建更均衡的多语言语言模型。