Utilizing large language models (LLMs) for transforming natural language questions into SQL queries (text-to-SQL) is a promising yet challenging approach, particularly when applied to real-world databases with complex and extensive schemas. In particular, effectively incorporating data catalogs and database values for SQL generation remains an obstacle, leading to suboptimal solutions. We address this problem by proposing a new pipeline that effectively retrieves relevant data and context, selects an efficient schema, and synthesizes correct and efficient SQL queries. To increase retrieval precision, our pipeline introduces a hierarchical retrieval method leveraging model-generated keywords, locality-sensitive hashing indexing, and vector databases. Additionally, we have developed an adaptive schema pruning technique that adjusts based on the complexity of the problem and the model's context size. Our approach generalizes to both frontier proprietary models like GPT-4 and open-source models such as Llama-3-70B. Through a series of ablation studies, we demonstrate the effectiveness of each component of our pipeline and its impact on the end-to-end performance. Our method achieves new state-of-the-art performance on the cross-domain challenging BIRD dataset.
翻译:利用大型语言模型(LLM)将自然语言问题转换为SQL查询(文本到SQL)是一种前景广阔但具有挑战性的方法,尤其是在应用于具有复杂且庞大数据模式的实际数据库时。具体而言,如何有效整合数据目录和数据库值以生成SQL仍然是一个障碍,这导致现有解决方案效果欠佳。我们通过提出一种新流程来解决该问题,该流程能够有效检索相关数据和上下文、选择高效的数据模式,并合成正确且高效的SQL查询。为提高检索精度,我们的流程引入了一种分层检索方法,该方法利用模型生成的关键词、局部敏感哈希索引和向量数据库。此外,我们还开发了一种自适应模式剪枝技术,该技术可根据问题复杂度和模型的上下文长度进行动态调整。我们的方法可泛化至前沿的专有模型(如GPT-4)和开源模型(如Llama-3-70B)。通过一系列消融实验,我们证明了流程中每个组件的有效性及其对端到端性能的影响。我们的方法在跨领域挑战性数据集BIRD上取得了最新的最优性能。