Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
翻译:尽管英语大语言模型已取得显著进展,但由于定制化资源的匮乏,其他语言可比模型的构建进程仍受到阻碍。本研究旨在通过引入一套专为印度语言大语言模型开发设计的扩展资源套件来弥合这一鸿沟,该套件覆盖22种语言,总计包含2510亿词元及7480万条指令-响应对。认识到数据质量与数量的双重重要性,我们的方法融合了经过人工验证的高质量精选数据、虽未验证但具价值的原始数据以及合成数据。我们构建了一个清洁、开源的流水线,用于从多样化来源(包括网站、PDF文档和视频)中精选预训练数据,并整合了爬取、清洗、标记及去重的最佳实践。针对指令微调,我们整合了现有的印度语言数据集,将英语数据集翻译/音译为印度语言,并利用LLaMa2和Mixtral模型基于印度版维基百科和Wikihow的文章生成对话内容。此外,我们通过为多种场景生成毒性提示,进而将这些毒性提示输入已对齐的LLaMa2模型以生成非毒性响应,从而解决毒性对齐问题。我们希望本工作发布的数据集、工具和资源不仅能推动印度语言大语言模型的研发,更能为将此类工作扩展至其他语言建立开源蓝图。本工作创建的数据及其他成果均以宽松许可协议发布。