MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan,Neil Chowdhury,Oliver Jaffe,James Aung,Dane Sherburn,Evan Mays,Giulio Starace,Kevin Liu,Leon Maksin,Tejal Patwardhan,Lilian Weng,Aleksander Mądry

from arxiv, 10 pages, 17 pages appendix. Equal contribution by first seven authors, authors randomized. Corrected footnote 4

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

翻译：我们介绍了MLE-bench，这是一个用于衡量AI代理在机器学习工程任务上表现能力的基准。为此，我们从Kaggle平台精心挑选了75项与机器学习工程相关的竞赛，构建了一套多样化的挑战性任务集，旨在测试诸如模型训练、数据集准备和实验运行等现实世界中的机器学习工程技能。我们利用Kaggle公开的排行榜为每项竞赛建立了人类基准表现。通过使用开源代理框架，我们在该基准上评估了多个前沿语言模型，发现表现最佳的配置——结合OpenAI的o1-preview模型与AIDE框架——能够在16.9%的竞赛中达到至少相当于Kaggle铜牌的水平。除了主要结果外，我们还研究了AI代理的各种资源扩展形式以及预训练数据污染的影响。我们开源了基准代码（github.com/openai/mle-bench/），以促进未来在理解AI代理的机器学习工程能力方面的研究。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日