We present Gyan AI Paramanu ("atom"), a family of novel language models for Indian languages. It is a collection of auto-regressive monolingual, bilingual, and multilingual Indic language models pretrained from scratch on a single GPU for 10 Indian languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu) of varying sizes ranging from 13.29M to 367.5M.The models are pretrained with a context size of 1024 on a single GPU. The models are very efficient, small, fast, and powerful. We have also developed an efficient most advanced Indic tokenizer that can even tokenize unseen languages. In order to avoid the "curse of multi-linguality" in our multilingual mParamanu model, we pretrained on comparable corpora by typological grouping using the same script. We performed human evaluation of our pretrained models for open end text generation on grammar, coherence, creativity, and factuality metrics for Bangla, Hindi, and Sanskrit. Our Bangla, Hindi, and Sanskrit models outperformed GPT-3.5-Turbo (ChatGPT), Bloom 7B, LLaMa-2 7B, OPT 6.7B, GPT-J 6B, GPTNeo 1.3B, GPT2-XL large language models (LLMs) by a large margin despite being smaller in size by 66 to 20 times compared to standard 7B LLMs. To run inference on our pretrained models, CPU is enough, and GPU is not needed. We also instruction-tuned our pretrained Bangla, Hindi, Marathi, Tamil, and Telugu models on 23k instructions in respective languages. Our pretrained and instruction-tuned models which are first of its kind, most powerful efficient small generative language models ever developed for Indic languages, and the various results lead to the conclusion that high quality generative language models are possible without high amount of compute power and humongous number of parameters. We plan to release our models at https://www.bharatgpts.com.
翻译:本文提出Gyan AI Paramanu("原子")——一个面向印度语言的新型语言模型家族。该系列包含从零开始在单个GPU上预训练的自动回归单语、双语及多语种印度语言模型,涵盖5种文字系统(孟加拉文、天城文、奥里亚文、泰米尔文、泰卢固文)下的10种印度语言(阿萨姆语、孟加拉语、印地语、孔卡尼语、迈蒂利语、马拉地语、奥里亚语、梵语、泰米尔语、泰卢固语),模型参数量从13.29M到367.5M不等。所有模型均在单个GPU上以1024的上下文长度进行预训练,具有高效、轻量、快速且强大的特性。我们同时开发了先进的印度语分词器,甚至能处理未见过的语言。为避免多语言模型mParamanu中的"多语言诅咒",我们通过按文字系统进行类型学分组,在可比较语料库上实施预训练。我们针对孟加拉语、印地语、梵语的开放式文本生成任务,从语法、连贯性、创造性和事实性指标对预训练模型进行了人工评估。结果显示,尽管模型规模仅为标准70亿参数大语言模型的1/66至1/20,我们的孟加拉语、印地语及梵语模型在性能上显著超越GPT-3.5-Turbo (ChatGPT)、Bloom 7B、LLaMa-2 7B、OPT 6.7B、GPT-J 6B、GPTNeo 1.3B、GPT2-XL等大语言模型。运行这些预训练模型仅需CPU即可完成推理,无需GPU支持。我们还对预训练的孟加拉语、印地语、马拉地语、泰米尔语、泰卢固语模型进行了指令微调,训练数据为各语言对应的23k条指令。这些开创性的预训练与指令微调模型,是迄今为印度语言开发的最强大高效的小型生成式语言模型。各项结果表明:无需庞大算力与海量参数,亦可构建高质量的生成式语言模型。我们计划在https://www.bharatgpts.com 公开发布这些模型。