Data agents are an emerging paradigm that leverages large language models (LLMs) and tool-using agents to automate data management, preparation, and analysis tasks. However, the term "data agent" is currently used inconsistently, conflating simple query responsive assistants with aspirational fully autonomous "data scientists". This ambiguity blurs capability boundaries and accountability, making it difficult for users, system builders, and regulators to reason about what a "data agent" can and cannot do. In this tutorial, we propose the first hierarchical taxonomy of data agents from Level 0 (L0, no autonomy) to Level 5 (L5, full autonomy). Building on this taxonomy, we will introduce a lifecycleand level-driven view of data agents. We will (1) present the L0-L5 taxonomy and the key evolutionary leaps that separate simple assistants from truly autonomous data agents, (2) review representative L0-L2 systems across data management, preparation, and analysis, (3) highlight emerging Proto-L3 systems that strive to autonomously orchestrate end-to-end data workflows to tackle diverse and comprehensive data-related tasks under supervision, and (4) discuss forward-looking research challenges towards proactive (L4) and generative (L5) data agents. We aim to offer both a practical map of today's systems and a research roadmap for the next decade of data-agent development.
翻译:数据智能体是一种新兴范式,它利用大型语言模型(LLMs)和工具使用型智能体来自动化数据管理、准备和分析任务。然而,当前“数据智能体”这一术语的使用并不一致,常将简单的查询响应助手与理想化的全自主“数据科学家”混为一谈。这种模糊性掩盖了能力边界和责任归属,使得用户、系统构建者和监管者难以准确判断“数据智能体”能做什么以及不能做什么。在本教程中,我们首次提出了数据智能体的层级分类法,从第0级(L0,无自主性)到第5级(L5,完全自主性)。基于此分类法,我们将引入一个以生命周期和层级驱动的数据智能体视图。我们将:(1)介绍L0至L5的分类法,以及区分简单助手与真正自主数据智能体的关键演进跃迁;(2)回顾数据管理、准备和分析领域中具有代表性的L0至L2系统;(3)重点介绍新兴的原型L3系统,这些系统致力于在监督下自主编排端到端数据工作流,以应对多样且全面的数据相关任务;(4)讨论面向主动性(L4)和生成性(L5)数据智能体的前瞻性研究挑战。我们的目标是既提供当前系统的实用图谱,也为未来十年的数据智能体发展提供研究路线图。