Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Furthermore, this cost increases when the goal is to make the LLM follow a specific workflow within a dialogue instead of single instructions. Inspired by the self-play technique in reinforcement learning and the use of LLMs to simulate human agents, we propose a more effective method for data collection through LLMs engaging in a conversation in various roles. This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning. We introduce an automated way to measure the (partial) success of a dialogue. This metric is used to filter the generated conversational data that is fed back in LLM for training. Based on our automated and human evaluations of conversation quality, we demonstrate that such self-talk data improves results. In addition, we examine the various characteristics that showcase the quality of generated dialogues and how they can be connected to their potential utility as training data.
翻译:大型语言模型(LLMs)是强大的对话智能体,但将其专用于特定功能仍具挑战性。指令微调——即通过人类生成的指令和样本响应来调整模型(Ouyang等,2022)——已被证明是实现这一目标的有效方法,但其需要大量数据样本,这些样本可能 a) 无法获取,或 b) 生成成本高昂。此外,当目标要求LLM遵循对话中的特定工作流而非单一指令时,这种成本将进一步增加。受强化学习中自博弈技术及利用LLM模拟人类智能体的启发,我们提出一种更高效的数据收集方法,通过LLM以不同角色参与对话来实现。该方法通过LLM的"自对译"生成训练数据,此类数据可经优化后用于监督微调。我们引入对话(部分)成功度的自动化评估指标,用于筛选生成的对话数据并反馈至LLM训练。基于对话质量的自动评估与人工评估,我们证明此类自对译数据能有效提升模型性能。此外,我们系统分析了生成对话的多项质量特征,并探讨其作为训练数据的潜在应用价值。