Text-to-SQL is a task that converts a natural language question into a structured query language (SQL) to retrieve information from a database. Large language models (LLMs) work well in natural language generation tasks, but they are not specifically pre-trained to understand the syntax and semantics of SQL commands. In this paper, we propose an LLM-based framework for Text-to-SQL which retrieves helpful demonstration examples to prompt LLMs. However, questions with different database schemes can vary widely, even if the intentions behind them are similar and the corresponding SQL queries exhibit similarities. Consequently, it becomes crucial to identify the appropriate SQL demonstrations that align with our requirements. We design a de-semanticization mechanism that extracts question skeletons, allowing us to retrieve similar examples based on their structural similarity. We also model the relationships between question tokens and database schema items (i.e., tables and columns) to filter out scheme-related information. Our framework adapts the range of the database schema in prompts to balance length and valuable information. A fallback mechanism allows for a more detailed schema to be provided if the generated SQL query fails. Ours outperforms state-of-the-art models and demonstrates strong generalization ability on three cross-domain Text-to-SQL benchmarks.
翻译:文本转SQL是一项将自然语言问题转换为结构化查询语言(SQL)以从数据库中检索信息的任务。大型语言模型(LLMs)在自然语言生成任务中表现优异,但并未经过专门预训练以理解SQL命令的语法和语义。本文提出一种基于LLM的文本转SQL框架,通过检索有效的示范示例来提示LLM。然而,即便问题背后的意图相似且对应的SQL查询具有相似性,不同数据库模式下的问题仍可能差异显著。因此,识别符合需求的合适SQL示范变得至关重要。我们设计了一种去语义化机制来提取问题骨架,从而能够基于结构相似性检索相似示例。同时,我们建模问题词元与数据库模式项(即表与列)之间的关系,以过滤与模式相关的信息。该框架通过调整提示中的数据库模式范围,在长度与有价值信息之间取得平衡。若生成的SQL查询失败,回退机制将提供更详细的模式。本方法在三个跨领域文本转SQL基准测试中均优于现有最优模型,并展现出强大的泛化能力。