BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

翻译：大型语言模型（LLMs）已成功应用于包括文本到SQL生成在内的多项任务。然而，此类工作大多聚焦于公开数据集，如Fiben、Spider和Bird。我们前期的研究表明，LLMs在查询大型私有企业数据仓库时效果显著下降，并为此发布了首个私有企业文本到SQL基准测试集Beaver。为构建Beaver，我们利用了通常易于获取的SQL日志。但手动标注这些日志以确定其对应的自然语言问题是一项艰巨的任务。要求训练有素的数据库管理员承担额外工作来构建和验证对应的自然语言表述不仅困难，而且成本高昂。为应对这一挑战，我们提出了BenchPress——一种人机协同系统，旨在加速领域特定文本到SQL基准测试集的创建。给定SQL查询，BenchPress利用检索增强生成（RAG）和LLMs生成多个自然语言描述草案，随后由人类专家进行选择、排序或编辑，以确保准确性和领域一致性。我们在已标注的企业SQL日志上评估了BenchPress，证明LLM辅助标注能极大减少创建高质量基准测试集所需的时间和精力。结果表明，人工验证与LLM生成建议相结合，能提升标注准确性、基准测试可靠性以及模型评估的鲁棒性。通过简化定制化基准测试集的创建流程，BenchPress为研究者和实践者提供了在特定领域工作负载上评估文本到SQL模型的机制。BenchPress可通过我们的公共GitHub仓库（https://github.com/fabian-wenz/enterprise-txt2sql）免费获取，也可通过我们的网站（http://dsg-mcgraw.csail.mit.edu:5000）访问。