Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of database-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman's $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman's $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at https://github.com/uiuc-kang-lab/text_to_sql_benchmarks.

翻译：研究人员已提出众多文本到SQL技术以简化数据分析并加速数据库驱动应用的开发。为比较这些技术并选择最佳方案进行部署，学术界依赖于公共基准测试及其排行榜。由于这些基准测试在问题构建与答案评估过程中高度依赖人工标注，标注的有效性至关重要。本文通过实证研究：(i) 对两个广泛使用的文本到SQL基准测试（BIRD与Spider 2.0-Snow）进行标注错误率基准测试；(ii) 修正BIRD开发集（Dev）的子集，以衡量标注错误对文本到SQL智能体性能及排行榜排名的影响。通过专家分析，我们发现BIRD Mini-Dev与Spider 2.0-Snow的错误率分别为52.8%与62.8%。我们在原始与修正后的BIRD开发集子集上重新评估了BIRD排行榜中全部16个开源智能体。结果显示性能变化范围为-7%至31%（相对值），排名变化范围为$-9$至$+9$位。我们进一步评估这些影响是否可推广至完整BIRD开发集。研究发现，智能体在未修正子集上的排名与完整开发集排名呈强相关性（斯皮尔曼$r_s$=0.85，$p$=3.26e-5），而与修正子集排名仅呈弱相关性（斯皮尔曼$r_s$=0.32，$p$=0.23）。这些发现表明标注错误可能显著扭曲报告的性能与排名，进而误导研究方向或部署决策。我们的代码与数据公开于https://github.com/uiuc-kang-lab/text_to_sql_benchmarks。