Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.
翻译:语言识别(Language Identification, LID)是多语言自然语言处理流程中的重要组成部分,它有助于语料库整理、训练数据分析以及大型语言模型的跨语言评估。尽管现有系统在高资源语言上表现近乎完美,但在低资源和语言关系密切的场景下仍然脆弱。我们提出了UniLID,一种基于UnigramLM分词算法的简单高效LID方法,利用了其概率框架、参数估计技术和推理策略。简而言之,我们在共享的分词器词汇表上学习语言条件化的单字分布,但将分词视为一种语言特定的现象。我们的方法在数据和计算上都很高效,支持在不重新训练现有模型的情况下增量添加新语言,并且可以自然地集成到现有的语言模型分词流程中。针对广泛使用的基线模型(包括fastText、GlotLID和CLD3)的实证评估表明,UniLID在标准基准测试中取得了有竞争力的性能,在低资源场景下显著提高了样本效率——每种语言仅需五个标注样本即可达到超过70%的准确率——并在细粒度方言识别上实现了大幅提升。