Multilingual large language models (LLMs) are expensive to pretrain and often suffer from imbalances across languages and datasets, English-centric bias, tokenizer oversegmentation for morphologically rich low-resource languages, and the curse of multilinguality. We introduce PARAMANU, the first family of Indian-only autoregressive language models trained from scratch on open-source language-specific data for the five most spoken Indian languages: Bengali, Hindi, Marathi, Tamil, and Telugu. All models are designed for affordability and are trained on a single GPU with a budget under $1,000, allowing under-resourced researchers to build competitive language models. To address low-resource challenges, we develop morphology-aligned, low-fertility tokenizers, propose an interpolation-based method for token position indices in RoPE based scaling to train longer sequences efficiently. We also create instruction-tuning datasets in Bangla that are translated to the other four languages. Despite their small size (108M-367M parameters), Paramanu achieves a strong performance-efficiency tradeoff and outperforms most larger multilingual models across all five languages. Our collection is available at https://huggingface.co/collections/mitodru/paramanu.
翻译:多语言大语言模型(LLM)的预训练成本高昂,且常面临语言与数据集间的不平衡、英语中心偏差、对形态丰富的低资源语言存在分词器过度分割,以及多语言性诅咒等问题。我们推出了PARAMANU系列模型,这是首个完全基于开源单语数据从头训练的印度语言自回归语言模型家族,涵盖使用最广泛的五种印度语言:孟加拉语、印地语、马拉地语、泰米尔语和泰卢固语。所有模型均以经济性为设计目标,可在单张GPU上以低于1000美元的预算完成训练,使资源受限的研究者能够构建具有竞争力的语言模型。为应对低资源挑战,我们开发了形态对齐的低生育率分词器,并提出基于旋转位置编码(RoPE)缩放中令牌位置索引的插值方法,以高效训练更长序列。我们还创建了孟加拉语指令微调数据集,并将其翻译为其他四种语言。尽管模型规模较小(1.08亿至3.67亿参数),Paramanu在性能与效率间取得了优异平衡,在所有五种语言上的表现均超越大多数规模更大的多语言模型。我们的模型集已在https://huggingface.co/collections/mitodru/paramanu发布。