We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script. For native-script text, it has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID is the first LID for romanized text in Indian languages. Two major challenges for romanized text LID are the lack of training data and low-LID performance when languages are similar. We provide simple and effective solutions to these problems. In general, there has been limited work on romanized text in any language, and our findings are relevant to other languages that need romanized language identification. Our models are publicly available at https://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training and test sets are also publicly available at https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses.
翻译:摘要:我们创建了公开可用的语言识别(LID)数据集和模型,涵盖印度宪法列出的全部 22 种印度语言的原文字和拉丁化文本。首先,我们构建了 Bhasha-Abhijnaanam,这是一个针对原文字及拉丁化文本的语言识别测试集,覆盖所有 22 种印度语言。此外,我们还训练了 IndicLID,一个针对上述所有语言在原文字和拉丁化文字中的语言识别器。对于原文字文本,它比现有 LID 具有更广的语言覆盖范围,并且性能与其他 LID 相当或更优。IndicLID 是首个针对印度语言拉丁化文本的 LID。拉丁化文本 LID 面临两大主要挑战:缺乏训练数据,以及语言相似时 LID 性能低下。我们针对这些问题提供了简单有效的解决方案。总体而言,关于任何语言的拉丁化文本的研究工作都很有限,我们的发现对于其他需要拉丁化语言识别的语言具有参考价值。我们的模型已在 https://ai4bharat.iitm.ac.in/indiclid 以开源许可证形式公开提供。我们的训练集和测试集也已在 https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam 以开源许可证形式公开提供。