BenchPress：一种人机协同标注系统，用于快速构建文本到SQL基准测试集 (BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation)

Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

翻译：大型语言模型（LLMs）已成功应用于包括文本到SQL生成在内的多项任务。然而，现有研究大多聚焦于公开数据集，如Fiben、Spider和Bird。我们前期的研究表明，LLMs在查询大型私有企业数据仓库时效果显著下降，并为此发布了首个私有企业文本到SQL基准测试集Beaver。在构建Beaver时，我们利用了通常易于获取的SQL日志。但手动标注这些日志以确定其对应的自然语言问题是一项艰巨的任务。要求经过严格培训的数据库管理员额外承担构建和验证对应自然语言表述的工作，不仅实施困难且成本高昂。为应对这一挑战，我们提出了BenchPress——一种人机协同系统，旨在加速领域特定文本到SQL基准测试集的创建。给定SQL查询，BenchPress利用检索增强生成（RAG）和LLMs生成多个自然语言描述草案，随后由人类专家进行选择、排序或编辑，以确保准确性和领域一致性。我们在已标注的企业SQL日志上评估了BenchPress，证明LLM辅助标注能极大减少构建高质量基准测试集所需的时间和精力。实验结果表明，人类验证与LLM生成建议的结合提升了标注准确性、基准测试可靠性及模型评估的鲁棒性。通过简化定制化基准测试集的创建流程，BenchPress为研究者和实践者提供了在特定领域工作负载上评估文本到SQL模型的机制。BenchPress已通过我们的GitHub公共仓库（https://github.com/fabian-wenz/enterprise-txt2sql）免费发布，同时也可在我们的网站（http://dsg-mcgraw.csail.mit.edu:5000）访问。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日