Text-to-SQL technology has become crucial for translating natural language into SQL queries in various industries, enabling non-technical users to perform complex data operations. The need for accurate evaluation methods has increased as these systems have grown more sophisticated. However, we found that the Execution Accuracy (EX), the most promising evaluation metric, still shows a substantial portion of false positives and negatives compared to human evaluation. Thus, this paper introduces FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries. Our method shows significantly higher agreement with human expert judgments, improving Cohen's kappa from 61 to 78.17. Re-evaluating top-performing models on the Spider and BIRD benchmarks using FLEX reveals substantial shifts in performance rankings, with an average performance decrease of 3.15 due to false positive corrections and an increase of 6.07 from addressing false negatives. This work contributes to a more accurate and nuanced evaluation of text-to-SQL systems, potentially reshaping our understanding of state-of-the-art performance in this field.
翻译:文本到SQL技术已成为各行业中将自然语言转化为SQL查询的关键技术,使非技术用户能够执行复杂的数据操作。随着这些系统日益复杂,对准确评估方法的需求也日益增长。然而,我们发现,与人工评估相比,最有前景的评估指标——执行准确率(EX)仍存在相当比例的误判情况。因此,本文提出FLEX(无错误执行),这是一种利用大型语言模型(LLMs)模拟人类专家级SQL查询评估的新方法,用于评估文本到SQL系统。我们的方法在人类专家判断一致性方面显著提高,将科恩卡帕系数从61提升至78.17。使用FLEX对Spider和BIRD基准测试中表现最佳的模型进行重新评估显示,性能排名发生显著变化:误判修正导致平均性能下降3.15,而漏判修正则带来6.07的性能提升。这项工作有助于实现更准确、更细致的文本到SQL系统评估,可能重塑我们对该领域最先进性能的理解。