A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Yizhang Zhu,Liangwei Wang,Chenyu Yang,Xiaotian Lin,Boyan Li,Wei Zhou,Xinyu Liu,Zhangyang Peng,Tianqi Luo,Yu Li,Chengliang Chai,Chong Chen,Shimin Di,Ju Fan,Ji Sun,Nan Tang,Fugee Tsung,Jiannan Wang,Chenglin Wu,Yanwei Xu,Shaolei Zhang,Yong Zhang,Xuanhe Zhou,Guoliang Li,Yuyu Luo

from arxiv, Please refer to our paper list and companion materials at: https://github.com/HKUSTDial/awesome-data-agents

The rapid advancement of large language models (LLMs) has spurred the emergence of data agents--autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth. Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation. Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.

翻译：大型语言模型（LLMs）的快速发展推动了数据智能体——旨在协调数据与人工智能生态系统以处理复杂数据相关任务的自主系统——的兴起。然而，当前“数据智能体”这一术语存在术语定义模糊和采用不一致的问题，常将简单的查询响应系统与复杂的自主架构混为一谈。这种术语模糊性导致了用户期望错配、责任归属挑战以及行业增长障碍。受SAE J3016驾驶自动化标准的启发，本综述首次提出了数据智能体的系统性分层分类法，包含六个层级，用以界定和追踪从手动操作（L0）到生成式、完全自主数据智能体（L5）愿景的自主性渐进演变，从而明确能力边界与责任分配。基于此框架，我们按自主性递增的顺序对现有研究进行了结构化梳理，涵盖数据管理、准备和分析领域的专用数据智能体，以及面向增强自主性的多功能综合系统的新兴探索。我们进一步分析了推进数据智能体发展的关键演进跃迁与技术鸿沟，特别是当前从L2到L3的过渡阶段——数据智能体正从程序化执行向自主化协调演进。最后，我们提出了前瞻性发展路线图，展望了主动式、生成式数据智能体的到来。