Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.
翻译:语言识别是从网络数据构建高质量多语言数据集的关键步骤。现有语言识别工具(如OpenLID或GlotLID)常难以区分密切相关的语言,也无法有效区分有效自然语言与噪声,这导致特定语言子集(尤其是低资源语言)受到污染。本研究通过增加训练数据、合并问题语言变体集群、引入专用于标记噪声的特殊标签,对OpenLID分类器进行了扩展。我们将这一扩展系统命名为OpenLID-v3,并在多个基准测试中将其与GlotLID进行对比评估。开发过程中,我们重点关注三组密切相关的语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部与法国南部的罗曼语变体;斯堪的纳维亚语言),并在现有评估数据集不足的情况下新增了评估数据集。研究发现,集成方法虽能提升识别精度,但会显著降低低资源语言的覆盖范围。OpenLID-v3可通过https://huggingface.co/HPLT/OpenLID-v3获取。