Spurred by recent advances in Large Language Models (LLMs), virtual assistants are poised to take a leap forward in terms of their dialogue capabilities. Yet a major bottleneck to achieving genuinely transformative task-oriented dialogue capabilities remains the scarcity of high quality data. Existing datasets, while impressive in scale, have limited domain coverage and contain few genuinely challenging conversational phenomena; those which are present are typically unlabelled, making it difficult to assess the strengths and weaknesses of models without time-consuming and costly human evaluation. Moreover, creating high quality dialogue data has until now required considerable human input, limiting both the scale of these datasets and the ability to rapidly bootstrap data for a new target domain. We aim to overcome these issues with LUCID, a modularised and highly automated LLM-driven data generation system that produces realistic, diverse and challenging dialogues. We use LUCID to generate a seed dataset of 4,277 conversations across 100 intents to demonstrate its capabilities, with a human review finding consistently high quality labels in the generated data.
翻译:受大型语言模型(LLMs)最新进展的推动,虚拟助手的对话能力有望实现飞跃。然而,实现真正具有变革性的任务导向型对话能力仍面临一个主要瓶颈——高质量数据的稀缺。现有数据集尽管规模可观,但领域覆盖范围有限,且包含真正具有挑战性的对话现象较少;即便存在此类现象,通常也未标注,这使得在不进行耗时且昂贵的人工评估的情况下难以评估模型的优劣。此外,此前生成高质量对话数据需要大量人力投入,这不仅限制了数据集的规模,也制约了快速构建新目标领域数据的能力。我们旨在通过LUCID克服这些问题——这是一个模块化、高度自动化的LLM驱动数据生成系统,可生成真实、多样且富有挑战性的对话。我们利用LUCID生成了一个涵盖100种意图的4,277条对话的种子数据集以展示其能力,人工审查发现所生成数据中标签质量持续优异。