Spoken dialogue is essential for human-AI interactions, providing expressive capabilities beyond text. Developing effective spoken dialogue systems (SDSs) requires large-scale, high-quality, and diverse spoken dialogue corpora. However, existing datasets are often limited in size, spontaneity, or linguistic coherence. To address these limitations, we introduce J-CHAT, a 76,000-hour open-source Japanese spoken dialogue corpus. Constructed using an automated, language-independent methodology, J-CHAT ensures acoustic cleanliness, diversity, and natural spontaneity. The corpus is built from YouTube and podcast data, with extensive filtering and denoising to enhance quality. Experimental results with generative spoken dialogue language models trained on J-CHAT demonstrate its effectiveness for SDS development. By providing a robust foundation for training advanced dialogue models, we anticipate that J-CHAT will drive progress in human-AI dialogue research and applications.
翻译:口语对话是人与人工智能交互的核心,其表达能力超越纯文本。开发高效的口语对话系统需要大规模、高质量且多样化的口语对话语料库。然而,现有数据集在规模、自发性或语言连贯性方面往往存在局限。为解决这些问题,我们提出了J-CHAT——一个包含76000小时的开源日语口语对话语料库。该语料库采用自动化、语言无关的方法构建,确保声学清晰度、数据多样性和自然自发性。语料库来源于YouTube和播客数据,并通过广泛过滤与去噪以提升质量。基于J-CHAT训练的生成式口语对话语言模型的实验结果表明,该语料库对口语对话系统开发具有有效性。通过为训练先进对话模型提供坚实基础,我们预期J-CHAT将推动人机对话研究及应用的发展。