Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
翻译:尽管英语大语言模型取得了显著进展,但由于缺乏定制化资源,为其他语言构建类似模型的进展受到阻碍。本工作旨在通过推出一套专为发展印度语系大语言模型设计的广泛资源套件来弥合这一鸿沟,该套件覆盖22种语言,包含总计2510亿词元及7480万条指令-回答对。考虑到数据质量与数量同等重要,我们的方法结合了高度精炼的人工验证数据、未经验证但具价值的数据以及合成数据。我们构建了一套开源清洁流水线,用于从多源(包括网站、PDF和视频)中整理预训练数据,整合了爬取、清洗、标记和去重的最佳实践。在指令微调方面,我们整合现有印度语数据集,将英语数据集翻译/转写为印度语言,并利用LLaMa2和Mixtral模型生成基于印度语维基百科和WikiHow文章的对话。此外,我们通过为多场景生成有毒提示词,并利用对齐后的LLaMa2模型对这些有毒提示词生成无毒响应,解决了毒性对齐问题。我们希望本工作发布的数据库集、工具与资源不仅能推动印度语大语言模型的研究与开发,还能为其他语言扩展类似工作建立开源蓝图。本文创建的数据及其他制品均以宽松许可证发布。