TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring

Text-to-SQL enables users to interact with databases using natural language, simplifying the retrieval and synthesis of information. Despite the remarkable success of large language models (LLMs) in translating natural language questions into SQL queries, widespread deployment remains limited due to two primary challenges. First, the effective use of text-to-SQL models depends on users' understanding of the model's capabilities-the scope of questions the model can correctly answer. Second, the absence of abstention mechanisms can lead to incorrect SQL generation going unnoticed, thereby undermining trust in the model's output. To enable wider deployment, it is crucial to address these challenges in model design and enhance model evaluation to build trust in the model's output. To this end, we introduce TrustSQL, a novel comprehensive benchmark designed to evaluate text-to-SQL reliability-defined as a model's ability to correctly handle any type of input question by generating correct SQL queries for feasible questions and abstaining from generating infeasible ones (e.g., due to schema incompatibility or functionalities beyond SQL). We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches: (1) pipeline-based methods combining SQL generators with infeasible question detectors and SQL error detectors for abstention; and (2) unified methods using a single model for the entire task. Our experimental results reveal that achieving high scores under severe penalties requires significant effort and provide a new perspective on developing text-to-SQL models for safer deployment. TrustSQL is available at https://github.com/glee4810/TrustSQL.

翻译：文本到SQL技术使用户能够通过自然语言与数据库交互，从而简化信息的检索与整合。尽管大型语言模型在将自然语言问题转化为SQL查询方面取得了显著成功，但由于两个主要挑战，其广泛部署仍然受限。首先，文本到SQL模型的有效使用依赖于用户对模型能力的理解——即模型能够正确回答的问题范围。其次，缺乏弃权机制可能导致错误的SQL生成未被察觉，从而削弱对模型输出的信任。为实现更广泛的部署，必须在模型设计中解决这些挑战，并通过增强模型评估来建立对模型输出的信任。为此，我们提出了TrustSQL——一个新颖的综合基准测试，旨在评估文本到SQL的可靠性，其定义为模型通过为可行问题生成正确的SQL查询并对不可行问题（例如由于模式不兼容或超出SQL功能范围）选择弃权，从而正确处理各类输入问题的能力。我们采用一种基于惩罚的新型评分指标，通过两种建模方法评估现有方法：（1）基于流水线的方法，将SQL生成器与不可行问题检测器及SQL错误检测器结合以实现弃权；（2）统一方法，使用单一模型完成整个任务。实验结果表明，在严苛惩罚下获得高分需要付出显著努力，这为开发更安全部署的文本到SQL模型提供了新的视角。TrustSQL已在https://github.com/glee4810/TrustSQL开源发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/