Impressive progress has been made on chat models based on Large Language Models (LLMs) recently; however, there is a noticeable lag in multi-turn conversations between open-source chat models (e.g., Alpaca and Vicuna) and the leading chat models (e.g., ChatGPT and GPT-4). Through a series of analyses, we attribute the lag to the lack of enough high-quality multi-turn instruction-tuning data. The available instruction-tuning data for the community are either single-turn conversations or multi-turn ones with certain issues, such as non-human-like instructions, less detailed responses, or rare topic shifts. In this paper, we address these challenges by introducing Parrot, a highly scalable solution designed to automatically generate high-quality instruction-tuning data, which are then used to enhance the effectiveness of chat models in multi-turn conversations. Specifically, we start by training the Parrot-Ask model, which is designed to emulate real users in generating instructions. We then utilize Parrot-Ask to engage in multi-turn conversations with ChatGPT across a diverse range of topics, resulting in a collection of 40K high-quality multi-turn dialogues (Parrot-40K). These data are subsequently employed to train a chat model that we have named Parrot-Chat. We demonstrate that the dialogues gathered from Parrot-Ask markedly outperform existing multi-turn instruction-following datasets in critical metrics, including topic diversity, number of turns, and resemblance to human conversation. With only 40K training examples, Parrot-Chat achieves strong performance against other 13B open-source models across a range of instruction-following benchmarks, and particularly excels in evaluations of multi-turn capabilities. We make all codes, datasets, and two versions of the Parrot-Ask model based on LLaMA2-13B and KuaiYii-13B available at https://github.com/kwai/KwaiYii/Parrot.
翻译:摘要:基于大型语言模型的对话模型近期取得了显著进展;然而,开源对话模型(如Alpaca和Vicuna)与领先对话模型(如ChatGPT和GPT-4)在多轮对话方面仍存在明显差距。通过一系列分析,我们将这一差距归因于缺乏足够高质量的多轮指令微调数据。目前社区可用的指令微调数据要么是单轮对话,要么存在某些缺陷(如指令不够拟人化、回复不够详细或话题转换频率过低)。本文通过引入Parrot(一种高度可扩展的解决方案)来应对这些挑战,该方案可自动生成高质量的指令微调数据,进而用于提升对话模型在多轮对话中的有效性。具体而言,我们首先训练Parrot-Ask模型,该模型旨在模拟真实用户生成指令。随后,我们利用Parrot-Ask模型与ChatGPT在多样化主题上进行多轮对话,最终收集到40K个高质量多轮对话(Parrot-40K)。这些数据被用于训练我们命名为Parrot-Chat的对话模型。实验表明,从Parrot-Ask采集的对话在关键指标(包括主题多样性、对话轮次数量及与人类对话的相似度)上显著优于现有的多轮指令遵循数据集。仅使用40K训练样本,Parrot-Chat在多个指令遵循基准测试中即可与13B开源模型相媲美,尤其在多轮能力评估中表现突出。我们已将全部代码、数据集及基于LLaMA2-13B和KuaiYii-13B的两个Parrot-Ask模型版本开源至https://github.com/kwai/KwaiYii/Parrot。