A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? To answer this question, we introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We then construct an agent that can perform ML experimentation based on ReAct framework. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgentBench with 37.5% average success rate. Our agents also display highly interpretable plans and actions. However, the success rates vary considerably; they span from 100% on well-established older datasets to as low as 0% on recent Kaggle challenges created potentially after the underlying LM was trained. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination. Our code is released at https://github.com/snap-stanford/MLAgentBench.
翻译:机器学习研究的核心方面之一是实验,即设计并运行实验、分析结果并迭代以达成某种积极成果(例如提高准确率)的过程。由强大语言模型驱动的智能体能否有效执行机器学习实验?为回答这一问题,我们提出了MLAgentBench,这是一套包含13个任务的测试基准,任务范围从提升CIFAR-10上的模型性能,到BabyLM等近期研究问题。对于每个任务,智能体可执行读取/写入文件、运行代码及检查输出等操作。我们随后基于ReAct框架构建了一个能够执行机器学习实验的智能体。我们对基于Claude v1.0、Claude v2.1、Claude v3 Opus、GPT-4、GPT-4-turbo、Gemini-Pro及Mixtral的智能体进行了基准测试,发现基于Claude v3 Opus的智能体在成功率方面表现最佳。它在MLAgentBench中的多个任务上能够构建出令人信服的机器学习模型,平均成功率达到37.5%。我们的智能体还展现出高度可解释的计划与行动。然而,成功率差异显著:在已得到充分研究的旧数据集上可达100%,而在可能是底层语言模型训练后发布的近期Kaggle挑战任务上,成功率低至0%。最后,我们识别了基于语言的智能体面临的若干关键挑战,例如长期规划与减少幻觉。我们的代码已发布于 https://github.com/snap-stanford/MLAgentBench。