Natural Language Understanding (NLU) for low-resource languages remains a major challenge in NLP due to the scarcity of high-quality data and language-specific models. Maithili, despite being spoken by millions, lacks adequate computational resources, limiting its inclusion in digital and AI-driven applications. To address this gap, we introducemaiBERT, a BERT-based language model pre-trained specifically for Maithili using the Masked Language Modeling (MLM) technique. Our model is trained on a newly constructed Maithili corpus and evaluated through a news classification task. In our experiments, maiBERT achieved an accuracy of 87.02%, outperforming existing regional models like NepBERTa and HindiBERT, with a 0.13% overall accuracy gain and 5-7% improvement across various classes. We have open-sourced maiBERT on Hugging Face enabling further fine-tuning for downstream tasks such as sentiment analysis and Named Entity Recognition (NER).
翻译:针对低资源语言的自然语言理解(NLU)由于高质量数据与语言专用模型的稀缺性,始终是自然语言处理领域的主要挑战。迈蒂利语尽管有数百万使用者,却缺乏足够的计算资源,限制了其在数字化与人工智能驱动应用中的融入。为填补这一空白,我们提出了 maiBERT——一种基于 BERT 架构、专门针对迈蒂利语采用掩码语言建模(MLM)技术进行预训练的语言模型。该模型基于新构建的迈蒂利语语料库进行训练,并通过新闻分类任务进行评估。实验结果表明,maiBERT 在新闻分类任务中取得了 87.02% 的准确率,优于现有的区域性模型如 NepBERTa 和 HindiBERT,整体准确率提升 0.13%,各类别准确率提升 5-7%。我们已在 Hugging Face 平台开源 maiBERT,以支持其在情感分析、命名实体识别(NER)等下游任务中的进一步微调应用。