Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text. Existing benchmarks only tackle individual pieces of this problem -- e.g., translating natural-language questions into SQL queries, answering questions over small tables provided in context -- but do not evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we present the Data Agent Benchmark (DAB), grounded in a formative study of enterprise data agent workloads across six industries. DAB comprises 54 queries across 12 datasets, 9 domains, and 4 database management systems. On DAB, the best frontier model (Gemini-3-Pro) achieves only 38% pass@1 accuracy. We benchmark five frontier LLMs, analyze their failure modes, and distill takeaways for future data agent development. Our benchmark and experiment code are published at github.com/ucbepic/DataAgentBench.
翻译:[translated abstract in Chinese]
企业用户日益依赖AI代理通过自然语言查询其数据。然而,构建可靠的数据代理仍面临诸多困难,因为真实世界的数据往往分散在多个异构数据库系统中,存在不一致的引用以及隐藏在非结构化文本中的信息。现有基准测试仅针对该问题的单个环节(例如,将自然语言问题转换为SQL查询、基于上下文提供的小型表格回答问题),但未能评估跨多个数据库系统进行数据集成、转换和分析的完整流程。为填补这一空白,我们基于对六个行业的企业数据代理工作负载形成性研究,提出了数据代理基准(DAB)。该基准涵盖12个数据集、9个领域、4个数据库管理系统上的54个查询。在DAB上,最先进的前沿模型(Gemini-3-Pro)仅达到38%的首位通过率(pass@1)准确率。我们对五个前沿大语言模型进行基准测试,分析其失败模式,并提炼出对未来数据代理开发的启示。我们的基准测试及实验代码已发布于 github.com/ucbepic/DataAgentBench。