Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world's 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.
翻译:大型语言模型在海量多语种语料库上进行训练,然而这种丰富性掩盖了深刻的危机:在全球现存的7,613种语言中,约有2,000种拥有数百万使用者的语言在数字生态系统中实际上处于隐形状态。我们提出了一个批判性分析框架,将语言活力(现实世界人口强度)与数字存在度(在线呈现)的实证测量,同后殖民理论与认知不公相结合,以阐释人工智能系统中的语言不平等并非偶然现象,而是结构性产物。通过分析所有已记录人类语言的数据,我们识别出四大类别:强势语言(33%,高活力与高数字存在度)、数字回响(6%,尽管活力衰退但数字存在度高)、消逝之声(36%,两个维度均偏低),以及关键性的隐形巨人(27%,高活力但数字存在度近乎为零)——即那些被数百万人使用却在大型语言模型宇宙中缺席的语言。我们证明这些模式反映了从殖民时代的语言等级体系到当代人工智能发展的连续性,构成了数字认知不公。我们的分析揭示,英语在人工智能领域的支配地位并非技术必然性,而是权力结构的产物,这种结构系统性地排挤了边缘化的语言知识。最后,我们探讨了语言技术去殖民化及人工智能惠益民主化的相关启示。