Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of data-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman's $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman's $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at https://github.com/uiuc-kang-lab/text_to_sql_benchmarks.

翻译：研究人员提出了众多文本到SQL技术，旨在简化数据分析并加速数据驱动应用的开发。为了比较这些技术并选择最佳方案进行部署，学术界依赖于公共基准测试及其排行榜。由于这些基准测试在问题构建和答案评估过程中严重依赖人工标注，标注的有效性至关重要。本文进行了一项实证研究，其内容包括：(i) 对两个广泛使用的文本到SQL基准测试（BIRD与Spider 2.0-Snow）的标注错误率进行基准评估；(ii) 修正BIRD开发集的一个子集，以衡量标注错误对文本到SQL智能体性能及排行榜排名的影响。通过专家分析，我们发现BIRD Mini-Dev和Spider 2.0-Snow的错误率分别为52.8%和62.8%。我们在原始和修正后的BIRD开发集子集上，重新评估了BIRD排行榜中全部16个开源智能体。结果显示，性能变化范围在-7%至31%之间（相对值），排名变化范围在$-9$至$+9$位之间。我们进一步评估了这些影响是否可推广至完整的BIRD开发集。研究发现，智能体在未修正子集上的排名与完整开发集排名呈现强相关性（斯皮尔曼$r_s$=0.85，$p$=3.26e-5），而与修正子集排名仅呈现弱相关性（斯皮尔曼$r_s$=0.32，$p$=0.23）。这些发现表明，标注错误可能显著扭曲报告的性能与排名，从而误导研究方向或部署决策。我们的代码与数据可在https://github.com/uiuc-kang-lab/text_to_sql_benchmarks获取。