Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training. To further enhance the utility of DialogStudio, we identify the licenses for each dataset and design domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we develop conversational AI models using the dataset collection, and our experiments in both zero-shot and few-shot learning scenarios demonstrate the superiority of DialogStudio. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio are made publicly accessible at https://github.com/salesforce/DialogStudio
翻译:尽管对话式AI取得了进展,语言模型在处理多样化对话任务时仍面临挑战,现有对话数据集集合往往缺乏多样性和全面性。为解决这些问题,我们提出了DialogStudio:在保留原始信息的同时,以统一格式整合的最大、最多样化的对话数据集集合。我们的集合涵盖开放域对话、任务导向对话、自然语言理解、对话推荐、对话摘要及基于知识的对话等数据集,使其成为对话研究与模型训练的极其丰富且多样的资源。为增强DialogStudio的实用性,我们明确了每个数据集的许可协议,并为部分对话设计了领域感知提示,以促进基于指令的微调。此外,我们利用该数据集集开发了对话式AI模型,在零样本和少样本学习场景中的实验证明了DialogStudio的优越性。为提升透明度并支持基于数据集和任务的研究以及语言模型预训练,所有与DialogStudio相关的数据集、许可协议、代码和模型均已公开于https://github.com/salesforce/DialogStudio。