Data preparation, which aims to transform heterogeneous and noisy raw tables into analysis-ready data, remains a major bottleneck in data science. Recent approaches leverage large language models (LLMs) to automate data preparation from natural language specifications. However, existing LLM-powered methods either make decisions without grounding in intermediate execution results, or rely on linear interaction processes that offer limited support for revising earlier decisions. To address these limitations, we propose DeepPrep, an LLM-powered agentic system for autonomous data preparation. DeepPrep constructs data preparation pipelines through iterative, execution-grounded interaction with an environment that materializes intermediate table states and returns runtime feedback. To overcome the limitations of linear interaction, DeepPrep organizes pipeline construction with tree-based agentic reasoning, enabling structured exploration and non-local revision based on execution feedback. To enable effective learning of such behaviors, we propose a progressive agentic training framework, together with data synthesis that supplies diverse and complex ADP tasks. Extensive experiments show that DeepPrep achieves data preparation accuracy comparable to strong closed-source models (e.g., GPT-5) while incurring 15x lower inference cost, while establishing state-of-the-art performance among open-source baselines and generalizing effectively across diverse datasets.
翻译:数据准备旨在将异构且含有噪声的原始表格转换为可供分析的数据,目前仍是数据科学领域的主要瓶颈。近期研究尝试利用大语言模型(LLMs)根据自然语言描述自动完成数据准备工作。然而,现有基于大语言模型的方法要么在缺乏中间执行结果依据的情况下做出决策,要么依赖线性的交互流程,难以有效修正早期决策。为克服这些局限性,我们提出了DeepPrep——一个基于大语言模型的自主数据准备智能体系统。DeepPrep通过与具体化中间表格状态并返回运行时反馈的环境进行迭代式、执行结果驱动的交互,构建数据准备流水线。为突破线性交互的限制,DeepPrep采用基于树状结构的智能体推理机制组织流水线构建过程,支持基于执行反馈的结构化探索与非局部修正。为实现此类行为的有效学习,我们提出了渐进式智能体训练框架,并结合数据合成技术提供多样化且复杂的自主数据准备任务。大量实验表明,DeepPrep在达到与强大闭源模型(如GPT-5)相当的数据准备准确率的同时,推理成本降低15倍,在开源基线模型中确立了最先进的性能,并能有效泛化至多样化数据集。