ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. Despite several challenges, recent models have made remarkable advancements in this task using large language models (LLMs). Interestingly, we find that LLM-based models without fine-tuning exhibit distinct natures compared to their fine-tuned counterparts, leading to inadequacies in current evaluation metrics to accurately convey their performance. Thus, we analyze the two primary metrics, Test Suite Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM), to examine their robustness for this task and address shortcomings. We compare the performance of 9 LLM-based models using EXE, the original ESM, and our improved ESM (called ESM+). Our results show that EXE and ESM have high false positive and negative rates of 11.3% and 13.9%, while ESM+ gives those of 0.1% and 2.6% respectively, providing a significantly more stable evaluation. We release the ESM+ script as open-source for the community to contribute, while enjoying a more reliable assessment of Text-to-SQL.

翻译：Text-to-SQL任务使得任何人都能够使用自然语言从SQL数据库中检索信息。尽管存在若干挑战，近期模型利用大语言模型（LLMs）在此任务上取得了显著进展。有趣的是，我们发现未经微调的基于LLM的模型与经过微调的模型相比展现出截然不同的特性，这导致当前评估指标难以准确反映其性能。因此，我们分析了两个主要指标——测试套件执行准确率（EXE）与精确集合匹配准确率（ESM），以检验它们对此任务的鲁棒性并弥补其不足。我们使用EXE、原始ESM及我们改进的ESM（称为ESM+）比较了9种基于LLM的模型的性能。结果显示，EXE与ESM的假阳性率与假阴性率分别高达11.3%和13.9%，而ESM+的对应值仅为0.1%与2.6%，提供了显著更稳定的评估。我们将ESM+脚本开源发布，供社区贡献使用，同时助力实现更可靠的Text-to-SQL评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日