Large language models (LLMs) have achieved strong performance on natural language to SQL (NL2SQL) benchmarks, yet their reported accuracy may be inflated by contamination from benchmark queries or structurally similar patterns seen during training. We introduce SPENCE (Syntactic Probing and Evaluation of NL2SQL Contamination Effects), a controlled syntactic probing framework for detecting and quantifying such contamination. SPENCE systematically generates syntactic variants of test queries for four widely used NL2SQL datasets-Spider, SParC, CoSQL, and the newer BIRD benchmark. We use SPENCE to evaluate multiple high-capacity LLMs under execution-based scoring. For each model, we measure changes in execution accuracy across increasing levels of syntactic divergence and quantify rank sensitivity using Kendall's tau with bootstrap confidence intervals. By aligning these robustness trends with benchmark release dates, we observe a clear temporal gradient: older benchmarks such as Spider exhibit the strongest negative values and thus the highest likelihood of training leakage, whereas the more recent BIRD dataset shows minimal sensitivity and appears largely uncontaminated. Together, these findings highlight the importance of temporally contextualized, syntactic-probing evaluation for trustworthy NL2SQL benchmarking.
翻译:大语言模型在自然语言到SQL的基准测试中展现出强劲性能,但其报告准确率可能因训练过程中对基准测试查询或结构相似模式的记忆而被高估。我们提出SPENCE框架——一种受控的句法探测方法,用于检测和量化此类数据污染。该框架为Spider、SParC、CoSQL及较新的BIRD基准四个广泛使用的NL2SQL数据集系统性地生成测试查询的句法变体。基于执行结果评分,我们利用SPENCE评估多个高容量大语言模型,通过渐进式句法偏离度测量各模型执行准确率的变化,并采用带自助置信区间的Kendall秩相关系数量化排名敏感性。通过将上述鲁棒性趋势与基准测试发布时间对齐,观察到明显的时间梯度效应:Spider等早期基准呈现最强负相关值,即训练数据泄露概率最高;而较新的BIRD数据集敏感性最低,基本未受污染。这些发现共同凸显了采用时间情境化句法探测评估对于构建可信NL2SQL基准测试的重要性。