Generating accurate SQL according to natural language questions (text-to-SQL) is a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Conventional text-to-SQL systems, comprising human engineering and deep neural networks, have made substantial progress. Subsequently, pre-trained language models (PLMs) have been developed and utilized for text-to-SQL tasks, achieving promising performance. As modern databases become more complex, the corresponding user questions also grow more challenging, leading PLMs with limited comprehension capabilities to produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods for PLMs, which, in turn, restricts the applications of PLM-based systems. Most recently, large language models (LLMs) have demonstrated significant capabilities in natural language understanding as the model scale remains increasing. Therefore, integrating the LLM-based implementation can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we present a comprehensive review of LLM-based text-to-SQL. Specifically, we propose a brief overview of the technical challenges and the evolutionary process of text-to-SQL. Then, we provide a detailed introduction to the datasets and metrics designed to evaluate text-to-SQL systems. After that, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we discuss the remaining challenges in this field and propose expectations for future research directions.
翻译:根据自然语言问题生成准确的SQL(Text-to-SQL)是一项长期存在的挑战,这主要源于用户问题理解、数据库模式解析和SQL生成过程中涉及的复杂性。传统的Text-to-SQL系统,包括人工工程方法和深度神经网络,已取得显著进展。随后,预训练语言模型(PLMs)被开发并应用于Text-to-SQL任务,取得了令人瞩目的性能。随着现代数据库日益复杂,相应的用户问题也变得更加困难,导致理解能力有限的PLMs可能生成错误的SQL。这要求为PLMs设计更精细且定制化的优化方法,而这也反过来限制了基于PLM系统的应用范围。最近,随着模型规模的持续扩大,大语言模型(LLMs)在自然语言理解方面展现出卓越能力。因此,引入基于LLM的实现方案能为Text-to-SQL研究带来独特的机遇、改进与解决方案。本综述对基于LLM的Text-to-SQL技术进行了全面回顾。具体而言,我们首先简要概述Text-to-SQL的技术挑战与发展历程;其次,详细介绍了用于评估Text-to-SQL系统的数据集与评价指标;随后,系统分析了基于LLM的Text-to-SQL最新研究进展;最后,探讨了该领域尚存的挑战,并对未来研究方向提出展望。