The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.

翻译：我们介绍了StatCan对话数据集，该数据集包含加拿大统计局工作人员与在线用户之间进行的19,379轮对话，这些用户正在寻找已发布的数据表。对话源于真实意图，使用英语或法语进行，最终工作人员从超过5000张复杂数据表中检索出相关表格。基于该数据集，我们提出两项任务：（1）基于持续进行的对话自动检索相关表格；（2）在每轮对话中自动生成恰当的代理回复。我们通过建立强基线模型来探究各项任务的难度。在时间分割数据上的实验表明，所有模型均难以泛化至未来对话场景——当从验证集转向测试集时，两项任务的性能均出现显著下降。此外，我们发现回复生成模型在判断何时返回表格方面存在困难。鉴于这两项任务对现有模型构成了重大挑战，我们呼吁学界针对本任务开发模型，这些模型可直接用于帮助知识工作者为实时聊天用户查找相关表格。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【斯坦福HAI白皮书】关于更新国家人工智能研发战略规划的建议，Recommendations on Updating the National Artificial Intelligence Research and Development Strategic Plan

专知会员服务

42+阅读 · 2022年3月15日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

45+阅读 · 2020年12月18日