Generating accurate SQL from natural language questions (text-to-SQL) is a long-standing challenge due to the complexities in user question understanding, database schema comprehension, and SQL generation. Conventional text-to-SQL systems, comprising human engineering and deep neural networks, have made substantial progress. Subsequently, pre-trained language models (PLMs) have been developed and utilized for text-to-SQL tasks, achieving promising performance. As modern databases become more complex, the corresponding user questions also grow more challenging, causing PLMs with parameter constraints to produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which, in turn, restricts the applications of PLM-based systems. Recently, large language models (LLMs) have demonstrated significant capabilities in natural language understanding as the model scale increases. Therefore, integrating LLM-based implementation can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we present a comprehensive review of LLM-based text-to-SQL. Specifically, we propose a brief overview of the technical challenges and the evolutionary process of text-to-SQL. Then, we provide a detailed introduction to the datasets and metrics designed to evaluate text-to-SQL systems. After that, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we discuss the remaining challenges in this field and propose expectations for future research directions.
翻译:从自然语言问题生成准确的SQL(文本到SQL)是一个长期存在的挑战,这源于用户问题理解、数据库模式理解和SQL生成的复杂性。传统的文本到SQL系统,包括人工工程和深度神经网络,已取得实质性进展。随后,预训练语言模型(PLMs)被开发并用于文本到SQL任务,取得了有前景的性能。随着现代数据库变得越来越复杂,相应的用户问题也更具挑战性,导致参数受限的PLMs生成错误的SQL。这需要更复杂和定制化的优化方法,进而限制了基于PLM系统的应用。近年来,随着模型规模的增加,大语言模型(LLMs)在自然语言理解方面展现出显著能力。因此,集成基于LLM的实现可以为文本到SQL研究带来独特的机遇、改进和解决方案。在本综述中,我们对基于LLM的文本到SQL进行了全面回顾。具体而言,我们简要概述了文本到SQL的技术挑战和演进过程。然后,我们详细介绍了用于评估文本到SQL系统的数据集和指标。之后,我们系统分析了基于LLM的文本到SQL的最新进展。最后,我们讨论了该领域剩余的挑战,并对未来研究方向提出了展望。