DataMaster: Towards Autonomous Data Engineering for Machine Learning

Yaxin Du,Xiyuan Yang,Zhifan Zhou,Wanxu Liu,Zixing Lei,Zimeng Chen,Fenyi Liu,Haotian Wu,Yuzhu Cai,Zexi Liu,Xinyu Zhu,WenHao Wang,Linfeng Zhang,Chen Qian,Siheng Chen

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

翻译：随着模型家族、训练方法与计算预算日益标准化，机器学习系统的性能提升愈发依赖于数据。然而，数据工程仍然高度依赖人工与临时性处理：从业者反复搜索外部数据集、将其适配至现有流程、通过下游训练验证候选数据，并总结先前尝试的经验。本文研究任务条件约束下的自主数据工程，即自主智能体通过仅优化数据层面来改进固定学习算法，包含外部数据发现、数据选择与组合、清洗与转换等环节，目标是在不改变学习算法的前提下获得更强的下游解决方案。针对自主数据工程中存在的开放式搜索空间、分支依赖性优化及延迟验证问题，我们提出DataMaster框架，该框架整合了树形结构化搜索、候选数据共享与累积记忆机制。DataMaster包含三大核心组件：组织不同数据工程分支的DataTree、存储已发现外部数据源以供复用的共享DataPool、记录节点结果、工件与可复用发现成果的Global Memory。通过组件协同，智能体能够发现候选数据、构建可执行的训练输入、基于下游反馈进行评估，并在各分支间传递有效证据。我们在MLE-Bench Lite与PostTrainBench两类基准上对DataMaster进行评估：在MLE-Bench Lite上，相较于初始分数，奖牌率提升32.27%；在PostTrainBench上，其在GPQA指标上超越指导模型（31.02%对比30.35%）。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

《大数据如何塑造机器人技术与军事科技中的智能系统》

专知会员服务

25+阅读 · 2025年11月7日