Translating natural language queries into SQL queries (NL2SQL or Text-to-SQL) has recently been empowered by large language models (LLMs). Using LLMs to perform NL2SQL methods on a large collection of SQL databases necessitates processing large quantities of meta-information about the databases, which in turn results in lengthy prompts with many tokens and high processing costs. To address this challenge, we introduce Datalake Agent, an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently. Instead of utilizing direct solvers for NL2SQL that call the LLM once with all meta-information in the prompt, the Datalake Agent employs an interactive loop to reduce the utilized meta-information. Within the loop, the LLM is used in a reasoning framework that selectively requests only the necessary information to solve a table question answering task. We evaluate the Datalake Agent on a collection of 23 databases with 100 table question answering tasks. The Datalake Agent reduces the tokens used by the LLM by up to 87\% and thus allows for substantial cost reductions while maintaining competitive performance.
翻译:将自然语言查询转换为SQL查询(NL2SQL或Text-to-SQL)近来因大型语言模型(LLMs)的赋能而得到显著提升。然而,在大量SQL数据库集合上使用LLMs执行NL2SQL方法时,需要处理海量的数据库元信息,这会导致提示文本过长、令牌数量庞大,进而产生高昂的处理成本。为应对这一挑战,我们提出了Datalake Agent——一个基于Agent的系统,旨在使LLM能更高效地解决NL2SQL任务。与直接调用LLM并一次性将所有元信息置于提示中的传统NL2SQL求解器不同,Datalake Agent采用交互式循环机制来减少所需使用的元信息。在该循环中,LLM被置于一个推理框架内运行,该框架会选择性请求仅解决表格问答任务所必需的信息。我们在包含23个数据库和100个表格问答任务的数据集上对Datalake Agent进行了评估。结果表明,Datalake Agent能将LLM使用的令牌数量降低高达87%,从而在保持竞争力性能的同时实现显著的成本节约。