We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench
翻译:我们提出了MLRC-Bench,这是一个旨在量化语言智能体应对具有挑战性的机器学习研究竞赛能力的基准测试,重点关注需要新方法的开放性研究问题。与先前工作(例如AI Scientist)使用LLM作为评判者来评估端到端智能体流程不同,MLRC-Bench衡量提出和实现新颖研究方法的关键步骤,并通过严格的协议和客观指标进行评估。我们精心挑选的7项竞赛任务揭示了LLM智能体面临的重大挑战。即使在MLAB框架下表现最佳的测试智能体(gemini-exp-1206),其得分也仅缩小了基线分数与顶尖人类参与者分数之间差距的9.3%。此外,我们的分析揭示了LLM评判的创新性与在尖端机器学习研究问题上的实际表现之间存在偏差。MLRC-Bench是一个动态基准,旨在随着新的机器学习竞赛而扩展,并鼓励对人工智能研究能力进行严格、客观的评估。我们的排行榜和代码可在以下网址获取:https://huggingface.co/spaces/launch/MLRC_Bench