INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.

翻译：近期的自然语言处理进展主要集中于标准化语言，使得大多数低资源方言，尤其是在印度语境下，未能得到充分服务。在印度，这一问题尤为重要：尽管印地语是全球第三大使用语言（超过6亿使用者），但其众多方言的代表性依然不足。奥里亚语的情况类似，其使用者约为4500万。虽然存在一些包含标准印地语和奥里亚语的数据集，但它们的地区方言在网络上几乎无迹可寻。我们引入了INDIC-DIALECT，这是一个人工整理的平行语料库，包含1.3万个句子对，涵盖11种方言和2种语言：印地语和奥里亚语。利用该语料库，我们构建了一个包含三项任务的多任务基准：方言分类、多项选择题（MCQ）问答以及机器翻译（MT）。我们的实验表明，像GPT-4o和Gemini 2.5这样的大语言模型在分类任务上表现不佳。而在印度语言上预训练并经过微调的基于Transformer的模型则显著提升了性能，例如，将方言分类的F1分数从19.6%提升至89.8%。对于方言到语言的翻译，我们发现混合AI模型取得了最高的BLEU分数61.32，而基线分数为23.36。有趣的是，由于生成方言句子的复杂性，我们观察到对于语言到方言的翻译，“基于规则后接AI”的方法取得了最佳的BLEU分数48.44，而基线分数为27.59。因此，INDIC-DIALECT是一个面向方言感知的印度语言自然语言处理的新基准，我们计划将其作为开源资源发布，以支持对低资源印度方言的进一步研究。