Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training. To further enhance the utility of DialogStudio, we identify the licenses for each dataset, design external knowledge and domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we develop conversational AI models using the dataset collection, and our experiments in both zero-shot and few-shot learning scenarios demonstrate the superiority of DialogStudio. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio are made publicly accessible\footnote{\url{https://github.com/salesforce/DialogStudio}}.
翻译:尽管对话式AI取得了进展,但语言模型在处理多样化对话任务时仍面临挑战,现有对话数据集集合往往缺乏多样性和全面性。为解决这些问题,我们提出DialogStudio:规模最大且最多样化的对话数据集集合,采用统一格式保存原始信息的同时保持一致性。该集合涵盖开放域对话、任务导向对话、自然语言理解、对话推荐、对话摘要及知识驱动对话等类型的数据,成为对话研究与模型训练极其丰富的资源。为提升DialogStudio的实用性,我们识别了每个数据集的许可证,为选定的对话设计外部知识和领域感知提示以促进指令感知微调。此外,我们利用该数据集集合开发对话式AI模型,在零样本和少样本学习场景中的实验证明了DialogStudio的优越性。为提高透明度并支持数据集与任务研究及语言模型预训练,DialogStudio相关的所有数据集、许可证、代码和模型均已公开提供\footnote{\url{https://github.com/salesforce/DialogStudio}}。