Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.
翻译:数据集是现代人工智能诸多突破的基础。自然语言处理领域近期取得的许多成就,都归因于在多样化任务集上对预训练模型进行微调,使大语言模型能够响应指令。指令微调需要专门构建和标注的数据集。然而,现有数据集几乎全部是英文。在本工作中,我们的首要目标是通过构建一个人工筛选的、覆盖65种语言的指令遵循数据集来弥合语言差距。我们与世界各地流利的语言使用者合作,收集自然形态的指令及其完成样例。此外,我们通过模板化和翻译现有数据集,创建了迄今为止规模最大的多语言集合,涵盖114种语言、包含5.13亿个样例。总体而言,我们贡献了四项关键资源:开发并开源了Aya标注平台、Aya数据集、Aya集合以及Aya评估套件。Aya倡议同时作为参与式研究的重要案例,汇聚了来自119个国家的合作者。我们认为这为未来旨在弥合资源差距的研究合作提供了有价值的框架。