Text-to-SQL parsing, which aims at converting natural language instructions into executable SQLs, has gained increasing attention in recent years. In particular, Codex and ChatGPT have shown impressive results in this task. However, most of the prevalent benchmarks, i.e., Spider, and WikiSQL, focus on database schema with few rows of database contents leaving the gap between academic study and real-world applications. To mitigate this gap, we present Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domains. Our emphasis on database values highlights the new challenges of dirty database contents, external knowledge between NL questions and database contents, and SQL efficiency, particularly in the context of massive databases. To solve these problems, text-to-SQL models must feature database value comprehension in addition to semantic parsing. The experimental results demonstrate the significance of database values in generating accurate text-to-SQLs for big databases. Furthermore, even the most effective text-to-SQL models, i.e. ChatGPT, only achieves 40.08% in execution accuracy, which is still far from the human result of 92.96%, proving that challenges still stand. Besides, we also provide an efficiency analysis to offer insights into generating text-to-efficient-SQLs that are beneficial to industries. We believe that BIRD will contribute to advancing real-world applications of text-to-SQL research. The leaderboard and source code are available: https://bird-bench.github.io/.
翻译:文本转SQL解析旨在将自然语言指令转换为可执行SQL语句,近年来受到越来越多的关注。特别是Codex和ChatGPT在该任务中展现出令人瞩目的成果。然而,现有主流基准测试(如Spider和WikiSQL)主要聚焦于包含少量数据库内容的数据库模式,导致学术研究与实际应用之间存在差距。为弥合这一差距,我们提出Bird——一个面向文本转SQL任务的大规模数据库大基准测试,包含12,751对文本转SQL数据及95个数据库,总容量达33.4 GB,覆盖37个专业领域。我们对数据库值的强调凸显了新挑战:脏数据内容、自然语言问题与数据库内容间的外部知识关联,以及大规模数据库场景下的SQL效率问题。要解决这些问题,文本转SQL模型除语义解析外还需具备数据库值理解能力。实验结果表明,数据库值在大规模数据库生成准确文本转SQL过程中具有关键意义。此外,即使最先进的文本转SQL模型(如ChatGPT)执行准确率仅为40.08%,远不及人类结果(92.96%),证明该领域仍面临严峻挑战。同时,我们还提供了效率分析,为生成有利于工业界的文本转高效SQL提供洞见。我们相信BIRD将推动文本转SQL研究在实际场景中的应用。排行榜与源代码可通过https://bird-bench.github.io/获取。