Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.
翻译:大型语言模型(LLMs)已成功应用于包括文本到SQL生成在内的多项任务。然而,现有研究大多聚焦于公开数据集,如Fiben、Spider和Bird。我们前期的研究表明,LLMs在查询大型私有企业数据仓库时效果显著下降,并为此发布了首个私有企业文本到SQL基准测试集Beaver。在构建Beaver时,我们利用了通常易于获取的SQL日志。但手动标注这些日志以确定其对应的自然语言问题是一项艰巨的任务。要求经过严格培训的数据库管理员额外承担构建和验证对应自然语言表述的工作,不仅实施困难且成本高昂。为应对这一挑战,我们提出了BenchPress——一种人机协同系统,旨在加速领域特定文本到SQL基准测试集的创建。给定SQL查询,BenchPress利用检索增强生成(RAG)和LLMs生成多个自然语言描述草案,随后由人类专家进行选择、排序或编辑,以确保准确性和领域一致性。我们在已标注的企业SQL日志上评估了BenchPress,证明LLM辅助标注能极大减少构建高质量基准测试集所需的时间和精力。实验结果表明,人类验证与LLM生成建议的结合提升了标注准确性、基准测试可靠性及模型评估的鲁棒性。通过简化定制化基准测试集的创建流程,BenchPress为研究者和实践者提供了在特定领域工作负载上评估文本到SQL模型的机制。BenchPress已通过我们的GitHub公共仓库(https://github.com/fabian-wenz/enterprise-txt2sql)免费发布,同时也可在我们的网站(http://dsg-mcgraw.csail.mit.edu:5000)访问。