Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.
翻译:语言识别是从网络数据构建高质量多语言数据集的关键步骤。现有的语言识别工具(如OpenLID或GlotLID)在识别密切关联语言以及区分有效自然语言与噪声方面常常存在困难,这污染了特定语言的子集,尤其是低资源语言。在本工作中,我们通过增加更多训练数据、合并有问题的语言变体聚类以及引入用于标记噪声的特殊标签,扩展了OpenLID分类器。我们将此扩展系统称为OpenLID-v3,并在多个基准测试中将其与GlotLID进行比较评估。在开发过程中,我们重点关注三组密切关联的语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部和法国南部的罗曼语变体;以及斯堪的纳维亚语言),并在现有数据集不足的情况下贡献了新的评估数据集。我们发现,集成方法虽然提高了精度,但也显著降低了对低资源语言的覆盖率。OpenLID-v3可在 https://huggingface.co/HPLT/OpenLID-v3 获取。