NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation

In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv, which allows the participants to chat anything they want as long as any element from the topic is mentioned and the topic shift is smooth. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1. These conversations contain in-depth discussions on related topics or widely natural transition between multiple topics. We believe either way is normal for human conversation. To facilitate the research on this corpus, we provide results of several benchmark models. Comparative results show that for this dataset, our current models are not able to provide significant improvement by introducing background knowledge/topic. Therefore, the proposed dataset should be a good benchmark for further research to evaluate the validity and naturalness of multi-turn conversation systems. Our dataset is available at https://ailab.tencent.com/ailab/nlp/dialogue/#datasets.

翻译：本文提出了一个中文多轮主题驱动对话数据集NaturalConv，该数据集允许参与者自由交谈，只要提及主题中的任何元素且话题转换自然流畅。我们的语料库包含来自六个领域的19.9K个对话和400K条话语，平均对话轮数为20.1。这些对话既包含对相关主题的深入讨论，也涉及多个主题间广泛自然的过渡。我们认为这两种方式在人类对话中均属常态。为促进基于该语料库的研究，我们提供了若干基准模型的实验结果。对比结果表明，对于本数据集，现有模型通过引入背景知识/主题未能带来显著性能提升。因此，本数据集可作为评估多轮对话系统有效性与自然度的优质基准平台，推动后续研究发展。数据集可通过 https://ailab.tencent.com/ailab/nlp/dialogue/#datasets 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日