High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations.
翻译:高质量的对话数据集对于开发能够与用户沟通的AI模型至关重要。促进聊天机器人与其用户之间深度交互的一种方式是通过角色设定——即用户性格中揭示其个性、动机和行为的方面。在多样且全面的角色数据集上训练自然语言处理模型,可以构建与用户建立更深层次联系并保持用户参与度的对话模型。本文利用大语言模型的能力,从种子数据集出发创建大规模、高质量的对话数据集。我们提出生成器-批评家架构框架,用于扩展初始数据集并提升对话质量。生成器是一个被提示输出对话的大语言模型,而批评家由混合专家大语言模型组成,负责控制生成对话的质量。这些专家筛选出最佳生成对话,从而用于优化生成器。我们发布了基于角色对话的合成角色聊天数据集,包含2万条对话。通过大量实验,我们从不同维度评估了合成角色聊天数据集及生成框架的质量,并观察到在三次迭代过程中,合成角色聊天数据集在图灵测试中输给角色聊天数据集的比例从17.2%降至8.8%。