Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Shivalika Singh,Freddie Vargus,Daniel Dsouza,Börje F. Karlsson,Abinaya Mahendiran,Wei-Yin Ko,Herumb Shandilya,Jay Patel,Deividas Mataciunas,Laura OMahony,Mike Zhang,Ramith Hettiarachchi,Joseph Wilson,Marina Machado,Luisa Souza Moura,Dominik Krzemiński,Hakimeh Fadaei,Irem Ergün,Ifeoma Okoh,Aisha Alaagib,Oshan Mudannayake,Zaid Alyafeai,Vu Minh Chien,Sebastian Ruder,Surya Guthikonda,Emad A. Alghamdi,Sebastian Gehrmann,Niklas Muennighoff,Max Bartolo,Julia Kreutzer,Ahmet Üstün,Marzieh Fadaee,Sara Hooker

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

翻译：数据集是现代人工智能诸多突破的基础。自然语言处理领域近期取得的许多成就，都归因于在多样化任务集上对预训练模型进行微调，使大语言模型能够响应指令。指令微调需要专门构建和标注的数据集。然而，现有数据集几乎全部是英文。在本工作中，我们的首要目标是通过构建一个人工筛选的、覆盖65种语言的指令遵循数据集来弥合语言差距。我们与世界各地流利的语言使用者合作，收集自然形态的指令及其完成样例。此外，我们通过模板化和翻译现有数据集，创建了迄今为止规模最大的多语言集合，涵盖114种语言、包含5.13亿个样例。总体而言，我们贡献了四项关键资源：开发并开源了Aya标注平台、Aya数据集、Aya集合以及Aya评估套件。Aya倡议同时作为参与式研究的重要案例，汇聚了来自119个国家的合作者。我们认为这为未来旨在弥合资源差距的研究合作提供了有价值的框架。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日