In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv, which allows the participants to chat anything they want as long as any element from the topic is mentioned and the topic shift is smooth. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1. These conversations contain in-depth discussions on related topics or widely natural transition between multiple topics. We believe either way is normal for human conversation. To facilitate the research on this corpus, we provide results of several benchmark models. Comparative results show that for this dataset, our current models are not able to provide significant improvement by introducing background knowledge/topic. Therefore, the proposed dataset should be a good benchmark for further research to evaluate the validity and naturalness of multi-turn conversation systems. Our dataset is available at https://ailab.tencent.com/ailab/nlp/dialogue/#datasets.
翻译:本文提出了一个中文多轮主题驱动对话数据集NaturalConv,该数据集允许参与者自由交谈,只要提及主题中的任何元素且话题转换自然流畅。我们的语料库包含来自六个领域的19.9K个对话和400K条话语,平均对话轮数为20.1。这些对话既包含对相关主题的深入讨论,也涉及多个主题间广泛自然的过渡。我们认为这两种方式在人类对话中均属常态。为促进基于该语料库的研究,我们提供了若干基准模型的实验结果。对比结果表明,对于本数据集,现有模型通过引入背景知识/主题未能带来显著性能提升。因此,本数据集可作为评估多轮对话系统有效性与自然度的优质基准平台,推动后续研究发展。数据集可通过 https://ailab.tencent.com/ailab/nlp/dialogue/#datasets 获取。