The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. Despite several challenges, recent models have made remarkable advancements in this task using large language models (LLMs). Interestingly, we find that LLM-based models without fine-tuning exhibit distinct natures compared to their fine-tuned counterparts, leading to inadequacies in current evaluation metrics to accurately convey their performance. Thus, we analyze the two primary metrics, Test Suite Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM), to examine their robustness for this task and address shortcomings. We compare the performance of 9 LLM-based models using EXE, the original ESM, and our improved ESM (called ESM+). Our results show that EXE and ESM have high false positive and negative rates of 11.3% and 13.9%, while ESM+ gives those of 0.1% and 2.6% respectively, providing a significantly more stable evaluation. We release the ESM+ script as open-source for the community to contribute, while enjoying a more reliable assessment of Text-to-SQL.
翻译:Text-to-SQL任务使得任何人都能够使用自然语言从SQL数据库中检索信息。尽管存在若干挑战,近期模型利用大语言模型(LLMs)在此任务上取得了显著进展。有趣的是,我们发现未经微调的基于LLM的模型与经过微调的模型相比展现出截然不同的特性,这导致当前评估指标难以准确反映其性能。因此,我们分析了两个主要指标——测试套件执行准确率(EXE)与精确集合匹配准确率(ESM),以检验它们对此任务的鲁棒性并弥补其不足。我们使用EXE、原始ESM及我们改进的ESM(称为ESM+)比较了9种基于LLM的模型的性能。结果显示,EXE与ESM的假阳性率与假阴性率分别高达11.3%和13.9%,而ESM+的对应值仅为0.1%与2.6%,提供了显著更稳定的评估。我们将ESM+脚本开源发布,供社区贡献使用,同时助力实现更可靠的Text-to-SQL评估。