Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.
翻译:社区驱动的文本到SQL评估平台在追踪该领域性能前沿进展中发挥着关键作用。评估过程的可靠性对于推动领域进步至关重要。当前评估方法主要基于测试,即在静态测试数据库上比较生成的SQL查询与人工标注的真实查询的执行结果。此类评估存在乐观偏差,因为两个查询可能在测试数据库上偶然产生相同输出,而实际上存在差异。本研究提出一种名为SpotIt的新型替代评估流程,其中形式化有界等价验证引擎主动搜索能够区分生成SQL查询与真实SQL查询的数据库。我们开发了扩展现有验证器以支持与文本到SQL相关的更丰富SQL子集的技术。基于知名BIRD数据集对十种文本到SQL方法的性能评估表明,基于测试的方法常会忽略生成查询与真实查询之间的差异。对验证结果的进一步分析揭示了当前文本到SQL评估更为复杂的图景。