Large language models (LLMs) becomes the dominant paradigm for the challenging task of text-to-SQL. LLM-empowered text-to-SQL methods are typically categorized into prompting-based and tuning approaches. Compared to prompting-based methods, benchmarking fine-tuned LLMs for text-to-SQL is important yet under-explored, partially attributed to the prohibitively high computational cost. In this paper, we present DB-GPT-Hub, an open benchmark suite for LLM-empowered text-to-SQL, which primarily focuses on tuning LLMs at large scales. The proposed benchmark consists of: 1. a standardized and comprehensive evaluation of text-to-SQL tasks by fine-tuning medium to large-sized open LLMs; 2. a modularized and easy-to-extend codebase with mainstream LLMs and experimental scenarios supported, which prioritizes fine-tuning methods but can be easily extended to prompt-based setting. Our work investigates the potential gains and the performance boundaries of tuning approaches, compared to prompting approaches and explores optimal solutions tailored to specific scenarios. We hope DB-GPT-Hub, along with these findings, enables further research and broad applications that would otherwise be difficult owing to the absence of a dedicated open benchmark. The project code has been released at https://github.com/eosphoros-ai/DB-GPT-Hub.
翻译:大语言模型已成为解决文本到SQL这一挑战性任务的主流范式。基于大语言模型的文本到SQL方法通常可分为提示工程与微调两类。相较于提示工程方法,对微调后的大语言模型进行文本到SQL基准测试虽至关重要却尚未得到充分探索,部分原因在于其极高的计算成本。本文提出DB-GPT-Hub——一个面向大语言模型文本到SQL任务的开放基准测试套件,重点聚焦于大规模语言模型的微调研究。该基准框架包含:1. 通过对中等至大规模开源大语言模型进行微调,构建标准化、全面化的文本到SQL任务评估体系;2. 模块化且易于扩展的代码库,支持主流大语言模型与典型实验场景,虽以微调方法为核心设计,亦可便捷扩展至提示工程场景。本研究系统探究了微调方法相较于提示方法的潜在优势与性能边界,并探索面向特定场景的优化方案。我们期望DB-GPT-Hub及其研究发现能推动相关领域的深入探索与广泛应用,这些工作此前因缺乏专用开放基准而难以开展。项目代码已发布于https://github.com/eosphoros-ai/DB-GPT-Hub。