IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Mohammed Safi Ur Rahman Khan,Priyam Mehta,Ananth Sankar,Umashankar Kumaravelan,Sumanth Doddapaneni,Suriyaprasaad G,Varun Balan G,Sparsh Jain,Anoop Kunchukuttan,Pratyush Kumar,Raj Dabre,Mitesh M. Khapra

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

翻译：尽管英语大语言模型取得了显著进展，但由于缺乏定制化资源，为其他语言构建类似模型的进展受到阻碍。本工作旨在通过推出一套专为发展印度语系大语言模型设计的广泛资源套件来弥合这一鸿沟，该套件覆盖22种语言，包含总计2510亿词元及7480万条指令-回答对。考虑到数据质量与数量同等重要，我们的方法结合了高度精炼的人工验证数据、未经验证但具价值的数据以及合成数据。我们构建了一套开源清洁流水线，用于从多源（包括网站、PDF和视频）中整理预训练数据，整合了爬取、清洗、标记和去重的最佳实践。在指令微调方面，我们整合现有印度语数据集，将英语数据集翻译/转写为印度语言，并利用LLaMa2和Mixtral模型生成基于印度语维基百科和WikiHow文章的对话。此外，我们通过为多场景生成有毒提示词，并利用对齐后的LLaMa2模型对这些有毒提示词生成无毒响应，解决了毒性对齐问题。我们希望本工作发布的数据库集、工具与资源不仅能推动印度语大语言模型的研究与开发，还能为其他语言扩展类似工作建立开源蓝图。本文创建的数据及其他制品均以宽松许可证发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日