Natural Language to SQL systems (NL-to-SQL) have recently shown a significant increase in accuracy for natural language to SQL query translation. This improvement is due to the emergence of transformer-based language models, and the popularity of the Spider benchmark - the de-facto standard for evaluating NL-to-SQL systems. The top NL-to-SQL systems reach accuracies of up to 85\%. However, Spider mainly contains simple databases with few tables, columns, and entries, which does not reflect a realistic setting. Moreover, complex real-world databases with domain-specific content have little to no training data available in the form of NL/SQL-pairs leading to poor performance of existing NL-to-SQL systems. In this paper, we introduce ScienceBenchmark, a new complex NL-to-SQL benchmark for three real-world, highly domain-specific databases. For this new benchmark, SQL experts and domain experts created high-quality NL/SQL-pairs for each domain. To garner more data, we extended the small amount of human-generated data with synthetic data generated using GPT-3. We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark. Thus, the challenge is many-fold: creating NL-to-SQL systems for highly complex domains with a small amount of hand-made training data augmented with synthetic data. To our knowledge, ScienceBenchmark is the first NL-to-SQL benchmark designed with complex real-world scientific databases, containing challenging training and test data carefully validated by domain experts.
翻译:自然语言到SQL系统(NL-to-SQL)近期在自然语言到SQL查询翻译的准确率方面取得了显著提升。这一进步得益于基于Transformer的语言模型的出现以及Spider基准(评估NL-to-SQL系统的事实标准)的普及。顶尖NL-to-SQL系统准确率可达85%。然而,Spider主要包含表、列和条目数量较少的简单数据库,无法反映真实场景。此外,包含领域特定内容的复杂真实世界数据库缺乏以自然语言/SQL对形式提供的训练数据,导致现有NL-to-SQL系统性能不佳。本文提出ScienceBenchmark——一个针对三个高度领域特定的真实世界数据库的新型复杂NL-to-SQL基准。在该基准中,SQL专家与领域专家为每个领域构建了高质量的自然语言/SQL对。为获取更多数据,我们使用GPT-3生成的合成数据扩充了少量人工生成数据。研究表明,该基准极具挑战性:在Spider上表现最佳的系统在该基准中性能极低。因此,挑战是多方面的:为高度复杂的领域创建NL-to-SQL系统,仅使用少量人工训练数据并辅以合成数据。据我们所知,ScienceBenchmark是首个针对复杂真实世界科学数据库设计的NL-to-SQL基准,其包含的训练与测试数据均经过领域专家严格验证。