Language Identification (LID) is the task of determining the language of a given text and is a fundamental preprocessing step that affects the reliability of downstream NLP applications. While recent work has expanded LID coverage for African languages, existing approaches remain limited in (i) the number of supported languages and (ii) their ability to make fine-grained distinctions among closely related varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 713 African languages, and AfroScope-Models, a suite of strong LID models with broad language coverage. To better distinguish highly confusable languages, we propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages. This approach improves macro F1 by 4.55 on this confusable subset compared to our best base model. Finally, we analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems. We position African LID as an enabling technology for large scale measurement of Africas linguistic landscape in digital text and release AfroScope-Data and AfroScope-Models publicly.
翻译:语言识别(LID)是确定给定文本语言的任务,它是影响下游自然语言处理应用可靠性的基础预处理步骤。尽管近期研究扩展了非洲语言的LID覆盖范围,但现有方法仍存在局限:(i)支持的语言数量有限;(ii)对密切相关的语言变体进行细粒度区分的能力不足。我们提出了AfroScope——一个统一的非洲语言识别框架,包括覆盖713种非洲语言的AfroScope-Data数据集,以及具有广泛语言覆盖能力的AfroScope-Models模型套件。为了更好地区分高度易混淆的语言,我们提出了一种分层分类方法,该方法利用Mirror-Serengeti——一个专门针对29种密切相关或地理邻近语言的嵌入模型。与我们的最佳基线模型相比,此方法在该易混淆子集上将宏平均F1分数提高了4.55。最后,我们分析了跨语言迁移和领域效应,为构建稳健的非洲语言识别系统提供指导。我们将非洲语言识别定位为大规模测量数字文本中非洲语言景观的使能技术,并公开发布了AfroScope-Data和AfroScope-Models。