The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release baseline models using state-of-the-art approaches in machine learning and fine-tuning pre-trained transformer models. Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task. The dataset and the codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface open source libraries.
翻译:语言识别任务是自然语言处理中至关重要的基础步骤。它通常作为多语言机器翻译、信息检索、问答系统和文本摘要等广泛应用的前置处理环节。语言识别的核心挑战在于区分嘈杂、简短及语码混合环境下的语言。对于词汇和语音特征相似但存在显著差异的多样化印度语言而言,这项任务尤为困难。许多印度语言使用相同的文字体系,进一步增加了识别难度。针对这些挑战,我们构建并发布了包含23种语言(含英语及全部22种印度官方语言)的25万句数据集,其中多数语言数据为新构建资源,所有语句均标注了语言标识符。同时,我们采用机器学习领域的前沿方法及预训练Transformer模型微调技术,开发并开源了基线模型。实验表明,我们的模型在语言识别任务上超越了当前最先进的预训练Transformer模型。数据集与代码已发布于https://yashingle-ai.github.io/ILID/及Huggingface开源平台。