Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.
翻译:自主机器学习工程(MLE)要求智能体在长周期内执行持续、迭代的优化。尽管近期基于大语言模型(LLM)的智能体展现出潜力,但当前用于MLE的基于提示的智能体因参数冻结而存在行为停滞问题。虽然强化学习(RL)提供了一种解决方案,但其在MLE中的应用受到高昂的执行延迟和低效数据选择的阻碍。认识到这些挑战,我们提出了AceGRPO,其包含两个核心组件:(1)演化数据缓冲区,持续将执行轨迹转化为可重用的训练任务;(2)基于可学习性潜力函数的自适应采样,动态优先处理智能体学习前沿的任务,以最大化学习效率。利用AceGRPO,我们训练的Ace-30B模型在MLE-Bench-Lite上实现了100%的有效提交率,接近专有前沿模型的性能,并超越了更大的开源基线模型(例如DeepSeek-V3.2),展现了持续迭代优化的强大能力。代码发布于 https://github.com/yuzhu-cai/AceGRPO。