The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities. Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances. Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text. To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities. COMI-LINGUA is publically availabe at: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.
翻译:数字通信的快速发展推动了语码混合在多语言社区中的广泛使用,尤其是印地语-英语混合。现有数据集通常专注于罗马化文本、覆盖范围有限或依赖合成数据,难以捕捉真实世界的语言细微差别。人工标注对于评估语码混合文本的自然度和可接受性至关重要。为应对这些挑战,我们推出了COMI-LINGUA——规模最大的手动标注语码混合文本数据集,包含100,970个实例,由三位专家标注员分别以天城文和罗马文字进行标注评估。该数据集支持五项基础自然语言处理任务:语言识别、主体语言识别、词性标注、命名实体识别和翻译。我们利用COMI-LINGUA对大语言模型在这些任务上的表现进行评估,揭示了当前多语言建模策略的局限性,并强调提升语码混合文本处理能力的必要性。COMI-LINGUA已公开发布于:https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA。