Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
翻译:多语言语言模型(LMs)有望为更广泛的自然语言处理提供支持,然而现有系统在世界各语言间的性能表现参差不齐。本文综述探讨了这些差距为何持续存在,以及它们反映的是内在的语言学难度还是建模人为因素。我们围绕两个问题梳理文献:语言差异是否源于表征与资源分配选择(例如分词、编码、数据暴露、参数共享)而非固有的复杂性;以及哪些设计选择能够缓解类型学上多样化语言之间的不平等。我们回顾了正字法、形态学、词汇多样性、句法、信息密度和类型学距离等语言学特征,并将每种特征与具体的建模机制联系起来。当分词、编码和数据暴露被标准化时,差距通常会缩小,这表明许多表面上的困难源于当前的建模选择。我们将这些见解综合为关于分词、采样、架构和评估的设计建议,以支持更均衡的多语言语言模型。