Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Fine-tuning on instruction data has been widely validated as an effective practice for implementing chat language models like ChatGPT. Scaling the diversity and quality of such data, although straightforward, stands a great chance of leading to improved performance. This paper aims to improve the upper bound of open-source models further. We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat, which does not involve human queries. Our objective is to capture the breadth of interactions that a human might have with an AI assistant and employs a comprehensive framework to generate multi-turn conversation iteratively. UltraChat contains 1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions. Our statistical analysis of UltraChat reveals its superiority in various key metrics, including scale, average length, diversity, coherence, etc., solidifying its position as a leading open-source dataset. Building upon UltraChat, we fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA. Our evaluations indicate that UltraLLaMA consistently outperforms other open-source models, including Vicuna, the previously recognized state-of-the-art open-source model. The dataset and the model will be publicly released\footnote{\url{https://github.com/thunlp/UltraChat}}.

翻译：对指令数据进行微调已被广泛验证为实施如ChatGPT等聊天语言模型的有效实践。扩展此类数据的多样性与质量虽看似直接，却极有可能显著提升模型性能。本文旨在进一步提升开源模型的能力上限。我们首先系统性地构建了一个多样化、信息丰富且规模庞大的教学对话数据集UltraChat，该数据集不包含人类查询。我们的目标在于捕捉人类与人工智能助手可能产生的广泛交互，并通过一个综合框架迭代生成多轮对话。UltraChat包含150万条高质量的多轮对话，覆盖了广泛的主题与指令。对其进行的统计分析表明，该数据集在规模、平均长度、多样性、连贯性等多项关键指标上具备优势，巩固了其作为领先开源数据集地位。基于UltraChat，我们对LLaMA模型进行微调，构建了强大对话模型UltraLLaMA。评估结果显示，UltraLLaMA持续优于其他开源模型，包括此前公认的最优开源模型Vicuna。该数据集与模型将公开发布\footnote{\url{https://github.com/thunlp/UltraChat}}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日