Large Language Model-based (LLM-based) Text-to-SQL methods have achieved important progress in generating SQL queries for real-world applications. When confronted with table content-aware questions in real-world scenarios, ambiguous data content keywords and non-existent database schema column names within the question leads to the poor performance of existing methods. To solve this problem, we propose a novel approach towards Table Content-aware Text-to-SQL with Self-Retrieval (TCSR-SQL). It leverages LLM's in-context learning capability to extract data content keywords within the question and infer possible related database schema, which is used to generate Seed SQL to fuzz search databases. The search results are further used to confirm the encoding knowledge with the designed encoding knowledge table, including column names and exact stored content values used in the SQL. The encoding knowledge is sent to obtain the final Precise SQL following multi-rounds of generation-execution-revision process. To validate our approach, we introduce a table-content-aware, question-related benchmark dataset, containing 1,692 question-SQL pairs. Comprehensive experiments conducted on this benchmark demonstrate the remarkable performance of TCSR-SQL, achieving an improvement of at least 13.7% in execution accuracy compared to other state-of-the-art methods.
翻译:基于大语言模型(LLM)的文本到SQL方法在面向实际应用生成SQL查询方面取得了重要进展。然而,当面对实际场景中涉及表格内容感知的问题时,问题中存在的模糊数据内容关键词以及数据库中不存在的模式列名,导致现有方法性能不佳。为解决此问题,我们提出了一种新颖的基于自检索的面向表格内容感知的文本到SQL方法(TCSR-SQL)。该方法利用LLM的上下文学习能力,提取问题中的数据内容关键词并推断可能相关的数据库模式,进而生成用于对数据库进行模糊搜索的种子SQL。搜索结果进一步用于通过设计的编码知识表来确认编码知识,该表包含SQL中使用的列名及确切的存储内容值。编码知识随后被送入一个多轮的生成-执行-修订流程,以获得最终的精确SQL。为验证我们的方法,我们引入了一个与表格内容感知及问题相关的基准数据集,其中包含1,692个问题-SQL对。在该基准上进行的全面实验表明,TCSR-SQL取得了显著性能,与其他最先进方法相比,在执行准确率上至少提升了13.7%。