Text-to-SQL, the task of translating natural language questions into SQL queries, is part of various business processes. Its automation, which is an emerging challenge, will empower software practitioners to seamlessly interact with relational databases using natural language, thereby bridging the gap between business needs and software capabilities. In this paper, we consider Large Language Models (LLMs), which have achieved state of the art for various NLP tasks. Specifically, we benchmark Text-to-SQL performance, the evaluation methodologies, as well as input optimization (e.g., prompting). In light of the empirical observations that we have made, we propose two novel metrics that were designed to adequately measure the similarity between SQL queries. Overall, we share with the community various findings, notably on how to select the right LLM on Text-to-SQL tasks. We further demonstrate that a tree-based edit distance constitutes a reliable metric for assessing the similarity between generated SQL queries and the oracle for benchmarking Text2SQL approaches. This metric is important as it relieves researchers from the need to perform computationally expensive experiments such as executing generated queries as done in prior works. Our work implements financial domain use cases and, therefore contributes to the advancement of Text2SQL systems and their practical adoption in this domain.
翻译:文本到SQL(Text-to-SQL)是将自然语言问题翻译成SQL查询的任务,广泛应用于各类业务流程。该任务的自动化是一项新兴挑战,它将赋能软件从业者使用自然语言与关系数据库进行无缝交互,从而弥合业务需求与软件能力之间的鸿沟。本文研究了在多种自然语言处理任务中已达到最先进水平的大语言模型(LLMs)。具体而言,我们对Text-to-SQL性能、评估方法以及输入优化(例如提示工程)进行了基准测试。基于实验观察结果,我们提出了两个专门用于充分衡量SQL查询之间相似性的新型指标。总体而言,我们向学界分享了多项发现,特别是关于如何在Text-to-SQL任务中选择合适的大语言模型。我们进一步证明,基于树的编辑距离是一种可靠的度量指标,可用于评估生成的SQL查询与基准参考(oracle)之间的相似性。该指标的重要性在于,它使研究人员无需像先前工作那样执行计算密集型的实验(例如运行生成的查询)。我们的工作实现了金融领域的用例,从而推动了Text2SQL系统在该领域的进步及其实际应用。