Benchmarking Large Language Models As AI Research Agents

Scientific experimentation involves an iterative process of creating hypotheses, designing experiments, running experiments, and analyzing the results. Can we build AI research agents to perform these long-horizon tasks? To take a step towards building and evaluating research agents on such open-ended decision-making tasks, we focus on the problem of machine learning engineering: given a task description and a dataset, build a high-performing model. In this paper, we propose MLAgentBench, a suite of ML tasks for benchmarking AI research agents. Agents can perform actions like reading/writing files, executing code, and inspecting outputs. With these actions, agents could run experiments, analyze the results, and modify the code of entire machine learning pipelines, such as data processing, architecture, training processes, etc. The benchmark then automatically evaluates the agent's performance objectively over various metrics related to performance and efficiency. We also design an LLM-based research agent to automatically perform experimentation loops in such an environment. Empirically, we find that a GPT-4-based research agent can feasibly build compelling ML models over many tasks in MLAgentBench, displaying highly interpretable plans and actions. However, the success rates vary considerably; they span from almost 90\% on well-established older datasets to as low as 10\% on recent Kaggle Challenges -- unavailable during the LLM model's pretraining -- and even 0\% on newer research challenges like BabyLM. Finally, we identify several key challenges for LLM-based research agents such as long-term planning and hallucination. Our code is released at https://github.com/snap-stanford/MLAgentBench.

翻译：科学实验涉及假设提出、实验设计、实验执行和结果分析的迭代过程。我们能否构建能够完成此类长期任务的AI研究代理？为了在开放式决策任务中朝构建和评估研究代理迈出一步，我们聚焦机器学习工程问题：给定任务描述和数据集，构建高性能模型。本文提出MLAgentBench——一套用于基准测试AI研究代理的机器学习任务套件。代理可执行读/写文件、运行代码和检查输出等操作。通过这些操作，代理能够运行实验、分析结果，并修改整个机器学习流程的代码（如数据处理、架构设计、训练过程等）。该基准测试随后根据多种与性能和效率相关的指标，客观地自动评估代理表现。我们还设计了一种基于LLM的研究代理，使其能够在此类环境中自动执行实验循环。实证发现，基于GPT-4的研究代理能够对MLAgentBench中的多项任务构建出有说服力的机器学习模型，展现出高度可解释的计划与行动。但成功率差异显著：从在成熟经典数据集上接近90%，到在LLM模型预训练期间不可用的近期Kaggle挑战中降至10%，甚至在像是BabyLM等新兴研究挑战上归零。最后，我们识别了基于LLM的研究代理面临的若干关键挑战，如长期规划与幻觉现象。我们的代码已开源至https://github.com/snap-stanford/MLAgentBench。