IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Mohammed Safi Ur Rahman Khan,Priyam Mehta,Ananth Sankar,Umashankar Kumaravelan,Sumanth Doddapaneni,Suriyaprasaad B,Varun Balan G,Sparsh Jain,Anoop Kunchukuttan,Pratyush Kumar,Raj Dabre,Mitesh M. Khapra

from arxiv, ACL-2024 Outstanding Paper

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

翻译：尽管英语大语言模型已取得显著进展，但由于定制化资源的匮乏，其他语言可比模型的构建进程仍受到阻碍。本研究旨在通过引入一套专为印度语言大语言模型开发设计的扩展资源套件来弥合这一鸿沟，该套件覆盖22种语言，总计包含2510亿词元及7480万条指令-响应对。认识到数据质量与数量的双重重要性，我们的方法融合了经过人工验证的高质量精选数据、虽未验证但具价值的原始数据以及合成数据。我们构建了一个清洁、开源的流水线，用于从多样化来源（包括网站、PDF文档和视频）中精选预训练数据，并整合了爬取、清洗、标记及去重的最佳实践。针对指令微调，我们整合了现有的印度语言数据集，将英语数据集翻译/音译为印度语言，并利用LLaMa2和Mixtral模型基于印度版维基百科和Wikihow的文章生成对话内容。此外，我们通过为多种场景生成毒性提示，进而将这些毒性提示输入已对齐的LLaMa2模型以生成非毒性响应，从而解决毒性对齐问题。我们希望本工作发布的数据集、工具和资源不仅能推动印度语言大语言模型的研发，更能为将此类工作扩展至其他语言建立开源蓝图。本工作创建的数据及其他成果均以宽松许可协议发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日