The cross-domain text-to-SQL task aims to build a system that can parse user questions into SQL on complete unseen databases, and the single-domain text-to-SQL task evaluates the performance on identical databases. Both of these setups confront unavoidable difficulties in real-world applications. To this end, we introduce the cross-schema text-to-SQL task, where the databases of evaluation data are different from that in the training data but come from the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally consisted of 4,340 question/SQL pairs across 2 databases. In order to generalize models to different medical systems, we extend CSS and create 19 new databases along with 29,280 corresponding dataset examples. Moreover, CSS is also a large corpus for single-domain Chinese text-to-SQL studies. We present the data collection approach and a series of analyses of the data statistics. To show the potential and usefulness of CSS, benchmarking baselines have been conducted and reported. Our dataset is publicly available at \url{https://huggingface.co/datasets/zhanghanchong/css}.
翻译:跨领域文本到SQL任务旨在构建一个系统,能够将用户问题解析为针对完全未见过的数据库的SQL查询,而单领域文本到SQL任务则评估在相同数据库上的性能。这两种设置在实际应用中都会面临不可避免的困难。为此,我们引入了跨模式文本到SQL任务,其中评估数据的数据库与训练数据中的数据库不同,但来自同一领域。此外,我们提出了CSS——一个大规模跨模式中文文本到SQL数据集,以开展相应的研究。CSS最初包含跨越2个数据库的4340个问题/SQL对。为了将模型泛化到不同的医疗系统,我们扩展了CSS,创建了19个新数据库及29280个对应的数据集样例。此外,CSS也是用于单领域中文文本到SQL研究的大规模语料库。我们介绍了数据收集方法,并对数据统计进行了一系列分析。为了展示CSS的潜力和实用性,我们进行了基准测试并报告了结果。我们的数据集公开在\url{https://huggingface.co/datasets/zhanghanchong/css}。