The growing demand for data-driven decision-making has created an urgent need for data agents that can reason over heterogeneous data (databases, documents, web content, images, videos, and audio) to answer complex analytical queries. However, evaluating such agents remains challenging: existing benchmarks often focus on isolated agent capabilities or limited data modalities, lacking comprehensive coverage of heterogeneous data and rigorous evaluation across diverse data agent architectures. To address these challenges, we present FDABench, a benchmark for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. Our contributions are threefold: (1) A comprehensive benchmark of 2,007 tasks spanning six data modalities with a unified, multi-granularity evaluation framework. (2) We design PUDDING, an agentic dataset construction framework that leverages LLM generation with iterative expert validation for reliable and scalable benchmark construction. (3) Extensive experiments across diverse data agent architectures, including general analytical agents, semantic operator frameworks, and RAG-based methods, revealing key insights and guidelines for future data agent development. Our data and source code are released at https://github.com/fdabench/FDAbench.
翻译:数据驱动决策需求的不断增长,迫切需要能够对异构数据(数据库、文档、网页内容、图像、视频和音频)进行推理以回答复杂分析查询的数据代理。然而,评估此类代理仍面临挑战:现有基准测试通常聚焦于孤立的代理能力或有限的数据模态,缺乏对异构数据的全面覆盖以及跨多样化数据代理架构的严格评估。为解决这些问题,我们提出了FDABench,一个用于评估数据代理在分析场景中对异构数据推理能力的基准测试。我们的贡献体现在三个方面:(1) 一个包含2007个任务、覆盖六种数据模态的综合基准测试,并配备统一的多粒度评估框架;(2) 设计了PUDDING,一种基于大语言模型生成与迭代专家验证的代理数据集构建框架,可实现可靠且可扩展的基准测试构建;(3) 跨多种数据代理架构(包括通用分析代理、语义算子框架及基于检索增强生成的方法)的大量实验,揭示了未来数据代理发展的关键洞察与指导原则。我们的数据和源代码已发布于https://github.com/fdabench/FDAbench。