There are three problems existing in the popular data-to-text datasets. First, the large-scale datasets either contain noise or lack real application scenarios. Second, the datasets close to real applications are relatively small in size. Last, current datasets bias in the English language while leaving other languages underexplored. To alleviate these limitations, in this paper, we present CATS, a pragmatic Chinese answer-to-sequence dataset with large scale and high quality. The dataset aims to generate textual descriptions for the answer in the practical TableQA system. Further, to bridge the structural gap between the input SQL and table and establish better semantic alignments, we propose a Unified Graph Transformation approach to establish a joint encoding space for the two hybrid knowledge resources and convert this task to a graph-to-text problem. The experiment results demonstrate the effectiveness of our proposed method. Further analysis on CATS attests to both the high quality and challenges of the dataset.
翻译:当前主流的数据到文本数据集存在三个问题:第一,大规模数据集要么包含噪声,要么缺乏真实应用场景;第二,接近实际应用的数据集规模较小;第三,现有数据集偏向英语,而其他语言的探索相对不足。为缓解这些局限,本文提出CATS——一个大规模、高质量且实用的中文答案到序列数据集。该数据集旨在为实际TableQA系统中的答案生成文本描述。此外,为弥合输入SQL与表格之间的结构性差异并建立更好的语义对齐,我们提出了一种统一图转换方法,为两种异构知识资源建立联合编码空间,并将此任务转化为图到文本问题。实验结果证明了所提出方法的有效性。对CATS的进一步分析证实了该数据集的高质量及其面临的挑战。