Collaborative Evolving Strategy for Automatic Data-Centric Development

Artificial Intelligence (AI) significantly influences many fields, largely thanks to the vast amounts of high-quality data for machine learning models. The emphasis is now on a data-centric AI strategy, prioritizing data development over model design progress. Automating this process is crucial. In this paper, we serve as the first work to introduce the automatic data-centric development (AD^2) task and outline its core challenges, which require domain-experts-like task scheduling and implementation capability, largely unexplored by previous work. By leveraging the strong complex problem-solving capabilities of large language models (LLMs), we propose an LLM-based autonomous agent, equipped with a strategy named Collaborative Knowledge-STudying-Enhanced Evolution by Retrieval (Co-STEER), to simultaneously address all the challenges. Specifically, our proposed Co-STEER agent enriches its domain knowledge through our proposed evolving strategy and develops both its scheduling and implementation skills by accumulating and retrieving domain-specific practical experience. With an improved schedule, the capability for implementation accelerates. Simultaneously, as implementation feedback becomes more thorough, the scheduling accuracy increases. These two capabilities evolve together through practical feedback, enabling a collaborative evolution process. Extensive experimental results demonstrate that our Co-STEER agent breaks new ground in AD^2 research, possesses strong evolvable schedule and implementation ability, and demonstrates the significant effectiveness of its components. Our Co-STEER paves the way for AD^2 advancements.

翻译：人工智能（AI）对众多领域产生深远影响，这在很大程度上得益于机器学习模型可获得海量高质量数据。当前的研究重点已转向以数据为中心的AI策略，将数据开发置于模型设计进展之上。实现这一过程的自动化至关重要。本文首次提出自动化数据中心化开发（AD^2）任务，并系统阐述其核心挑战：该任务需要具备类似领域专家的任务调度与实施能力，而现有研究对此尚未充分探索。通过利用大语言模型（LLMs）强大的复杂问题解决能力，我们提出一种基于LLM的自主智能体，配备名为“协作式知识-学习-检索增强演化”（Co-STEER）的策略，以协同应对所有挑战。具体而言，我们提出的Co-STEER智能体通过演化策略不断丰富其领域知识，并通过积累与检索领域特定实践经验来同步提升其调度与实施能力。随着调度方案的优化，实施能力得以加速提升；同时，实施反馈的不断完善又促使调度准确性持续提高。这两种能力通过实践反馈形成协同演化机制。大量实验结果表明，我们的Co-STEER智能体在AD^2研究中取得突破性进展，具备强大的可演化调度与实施能力，并验证了其各组件模块的显著有效性。Co-STEER为AD^2研究的推进开辟了新的路径。