NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

Natural Language to SQL (NL2SQL) technology empowers non-expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM-enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine-grained metrics that systematically quantify module-level effectiveness and efficiency. We further implement these metrics in a flexible multi-agent framework, allowing configurable benchmarking across diverse NL2SQL approaches. Leveraging NL2SQLBench, we rigorously evaluate ten representative open-source methods on two datasets, the BIRD development set and the ScienceBenchmark development set, using two LLMs, DeepSeek-V3 and GPT-4o mini. We systematically assess each approach across the three core modules and evaluate multiple critical performance dimensions. Our evaluation reveals significant gaps in existing NL2SQL methods, highlighting not only substantial room for accuracy improvements but also the significant computational inefficiency, which severely hampers real-world adoption. Furthermore, our analysis identifies critical shortcomings in current benchmark datasets and evaluation rules, emphasizing issues such as inaccurate gold SQL annotations and limitations in existing evaluation rules. By synthesizing these insights into a unified benchmarking, our study establishes a clear reference point for fair comparison and serves as essential guidance for future targeted innovation in NL2SQL technology.

翻译：自然语言到SQL（NL2SQL）技术使非专家用户无需掌握SQL专业知识即可查询关系数据库。虽然大语言模型（LLM）显著提升了NL2SQL算法性能，但其快速发展速度已超过系统性评估的跟进，导致在理解其有效性、效率与局限性方面存在关键空白。为此，我们提出NL2SQLBench——首个面向大语言模型赋能的NL2SQL方法的多模块评估与基准测试框架。具体而言，我们将NL2SQL系统拆解为三个核心模块：模式选择、候选生成与查询修正。针对每个模块，我们全面综述现有策略，并提出新颖的细粒度度量指标以系统量化模块级效能与效率。我们进一步将这些度量指标集成至灵活的多智能体框架中，实现跨不同NL2SQL方法的可配置基准测试。借助NL2SQLBench，我们基于BIRD开发集和ScienceBenchmark开发集两个数据集，采用DeepSeek-V3与GPT-4o mini两种大语言模型，对十种代表性开源方法进行严格评估。我们系统评估各方法在三个核心模块上的表现，并考察多项关键性能维度。评估结果揭示了现有NL2SQL方法的显著差距，不仅表明准确率存在巨大提升空间，更暴露出严重阻碍实际部署的计算效率低下问题。此外，我们的分析指出当前基准数据集与评估规则的关键缺陷，着重强调黄金SQL标注不准确及现有评估规则的局限性。通过将这些洞见整合为统一基准测试，本研究为公平比较建立了清晰参考基准，并为未来NL2SQL技术的定向创新提供关键指导。