Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models and curated corpora. While multilingual models provide limited Urdu support, they suffer from poor performance, high computational costs, and cultural inaccuracies due to insufficient training data. To address these challenges, we present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings. We curate a 33GB Urdu corpus from diverse sources, develop a custom BPE tokenizer that reduces tokenization overhead by atleast 20-30% compared to multilingual alternatives, and pretrain a 100M-parameter decoder-only model. In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size, reaching 66.6% accuracy on sentiment classification and BLEU scores exceeding 30 on grammar correction tasks. The complete methodology -- including corpus, tokenizer, model weights, and evaluation benchmarks -- is released openly to establish a baseline for Urdu NLP research and provide a scalable framework for other underrepresented languages.
翻译:乌尔都语在全球有2.3亿使用者,但缺乏专门的基于Transformer的语言模型和经过整理的语料库。尽管多语言模型提供有限的乌尔都语支持,但由于训练数据不足,它们存在性能不佳、计算成本高以及文化不准确等问题。为应对这些挑战,我们提出了UrduLM,一种在低资源环境下预训练的单语乌尔都语语言模型。我们从多种来源整理了一个33GB的乌尔都语语料库,开发了一个自定义的BPE分词器,与多语言替代方案相比,该分词器至少降低了20-30%的分词开销,并预训练了一个仅包含1亿参数的解码器模型。在少样本评估中,UrduLM达到了与规模为其30倍的多语言模型相竞争的性能,在情感分类任务上达到66.6%的准确率,在语法纠正任务上BLEU分数超过30。完整的方法——包括语料库、分词器、模型权重和评估基准——均已公开发布,旨在为乌尔都语自然语言处理研究建立基线,并为其他代表性不足的语言提供一个可扩展的框架。