A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Yizhang Zhu,Liangwei Wang,Chenyu Yang,Xiaotian Lin,Boyan Li,Wei Zhou,Xinyu Liu,Zhangyang Peng,Tianqi Luo,Yu Li,Chengliang Chai,Chong Chen,Shimin Di,Ju Fan,Ji Sun,Nan Tang,Fugee Tsung,Jiannan Wang,Chenglin Wu,Yanwei Xu,Shaolei Zhang,Yong Zhang,Xuanhe Zhou,Guoliang Li,Yuyu Luo

from arxiv, Please refer to our paper list and companion materials at: https://github.com/HKUSTDial/awesome-data-agents

The rapid advancement of large language models (LLMs) has spurred the emergence of data agents, autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth. Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation. Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning proactive, generative data agents.

翻译：大型语言模型（LLM）的快速发展催生了数据智能体的出现，这是一种旨在协调“数据+人工智能”生态系统以应对复杂数据相关任务的自主系统。然而，当前“数据智能体”这一术语存在定义模糊和采用不一致的问题，常将简单的查询响应系统与复杂的自主架构混为一谈。这种术语上的模糊性导致了用户期望错位、责任归属挑战以及行业发展障碍。受SAE J3016驾驶自动化标准的启发，本综述首次提出了一个系统化的数据智能体分层分类法，该分类法包含六个级别，用以界定和追踪从手动操作（L0）到生成式、完全自主的数据智能体（L5）愿景这一过程中自主性的渐进式转变，从而明确能力边界与责任分配。基于此框架，我们对现有研究进行了结构化梳理，按自主性递增的顺序，涵盖了用于数据管理、准备和分析的专用数据智能体，以及为构建自主性更强、功能更全面的通用系统所做的初步探索。我们进一步分析了推动数据智能体发展的关键演进跃迁与技术鸿沟，特别是当前正在进行的从L2到L3的过渡阶段，即数据智能体从程序化执行向自主化编排的演进。最后，我们提出了一个前瞻性的发展路线图，展望了主动式、生成式数据智能体的未来。