A practical text-to-SQL system should generalize well on a wide variety of natural language questions, unseen database schemas, and novel SQL query structures. To comprehensively evaluate text-to-SQL systems, we introduce a UNIfied benchmark for Text-to-SQL Evaluation (UNITE). It is composed of publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K databases. Compared to the widely used Spider benchmark, we introduce $\sim$120K additional examples and a threefold increase in SQL patterns, such as comparative and boolean questions. We conduct a systematic study of six state-of-the-art (SOTA) text-to-SQL parsers on our new benchmark and show that: 1) Codex performs surprisingly well on out-of-domain datasets; 2) specially designed decoding methods (e.g. constrained beam search) can improve performance for both in-domain and out-of-domain settings; 3) explicitly modeling the relationship between questions and schemas further improves the Seq2Seq models. More importantly, our benchmark presents key challenges towards compositional generalization and robustness issues -- which these SOTA models cannot address well. Our code and data processing script are available at https://github.com/awslabs/unified-text2sql-benchmark
翻译:一个实用的文本到SQL系统应能良好泛化至多样化的自然语言问题、未见数据库模式以及新颖的SQL查询结构。为全面评估文本到SQL系统,我们提出了面向文本到SQL评估的统一基准(UNITE)。该基准由公开可用的文本到SQL数据集构成,涵盖来自12个以上领域的自然语言问题、超3900种模式的SQL查询以及29000个数据库。与广泛使用的Spider基准相比,我们引入了约12万个额外样本,并使SQL模式数量增加三倍,涵盖比较型与布尔型问题等。我们系统性地研究了六种最先进(SOTA)的文本到SQL解析器在新基准上的表现,结果表明:1) Codex在跨领域数据集上表现惊人;2) 专用解码方法(如约束束搜索)可同时提升域内与域外场景性能;3) 显式建模问题与模式间关系可进一步提升Seq2Seq模型效果。更重要的是,我们的基准揭示了当前SOTA模型难以应对的复合泛化与鲁棒性关键挑战。相关代码与数据处理脚本已开源至 https://github.com/awslabs/unified-text2sql-benchmark