Language Identification (LID), the task of determining the language of a given text, is a fundamental preprocessing step that shapes the reliability of downstream NLP applications. While recent work has expanded African LID, existing systems remain limited in both language coverage and fine-grained discrimination among closely related languages and varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 640 languages, and AfroScope-Models, a suite of strong LID models with broad African language coverage. To address persistent confusions among closely related languages, we propose a hierarchical classification approach that leverages AfroScope-Mirror, a specialized embedding model for targeted disambiguation, improving macro-F1 by 1.57 points on the confusable subset compared to our best base model. We further analyze cross-lingual transfer and domain effects, showing how language-family structure, script compatibility, and domain coverage shape LID performance. We position African LID as an enabling technology for large-scale measurement of Africa's linguistic landscape in digital text, and release AfroScope-Data and AfroScope-Models online.
翻译:语言识别(LID),即确定给定文本语言的任务,是塑造下游自然语言处理应用可靠性的基础预处理步骤。尽管近期研究扩展了非洲语言识别的能力,但现有系统在语言覆盖范围以及近缘语言与变体的细粒度区分方面仍存在局限。我们提出AfroScope——一个统一的非洲语言识别框架,包含覆盖640种语言的数据集AfroScope-Data,以及具备广泛非洲语言覆盖能力的强健语言识别模型套件AfroScope-Models。针对近缘语言间持续存在的混淆问题,我们提出一种层级分类方法,通过专门用于定向消歧的嵌入模型AfroScope-Mirror,在易混淆子集上将宏F1值较最佳基线模型提升1.57个百分点。我们进一步分析了跨语言迁移与领域效应,揭示了语言家族结构、文字兼容性及领域覆盖范围对语言识别性能的影响机制。我们将非洲语言识别定位为大规模测量非洲数字文本语言景观的使能技术,并已在线上公开发布AfroScope-Data与AfroScope-Models。