Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi.

翻译：大语言模型发展的关键技术之一涉及指令调优，这有助于使模型响应与人类期望对齐，从而实现令人印象深刻的学习能力。指令调优的两种主要方法包括监督式微调（SFT）和基于人类反馈的强化学习（RLHF），目前它们被用于制造最优秀的商业大语言模型（例如ChatGPT）。为了提高大语言模型在研究与开发工作中的可及性，近期也出现了多种经过指令调优的开源大语言模型，例如Alpaca、Vicuna等。然而，现有的开源大语言模型仅针对英语及少数流行语言进行了指令调优，这阻碍了它们对世界上许多其他语言的影响力与可及性。在少数探索多语言大语言模型指令调优的最新工作中，SFT已被用作对多语言大语言模型进行指令调优的唯一方法。这使得基于RLHF的微调大语言模型在多语言领域存在显著空白，并引发了关于RLHF如何提升多语言指令调优性能的重要问题。为解决这一问题，我们提出了Okapi——首个基于RLHF为多语言提供指令调优大语言模型的系统。Okapi引入了涵盖26种不同语言的指令及响应排序数据，以促进未来多语言大语言模型研究的实验与开发。我们还提供了基准数据集，以便对多语言生成式大语言模型进行评估。实验表明，在不同基础模型和数据集上，RLHF在多语言指令调优中的表现优于SFT。我们的框架和资源已发布在https://github.com/nlp-uoregon/Okapi。