ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Xiangru Tang,Yuliang Liu,Zefan Cai,Yanjun Shao,Junjie Lu,Yichi Zhang,Zexuan Deng,Helan Hu,Kaikai An,Ruijun Huang,Shuzheng Si,Sheng Chen,Haozhe Zhao,Liang Chen,Yan Wang,Tianyu Liu,Zhiwei Jiang,Baobao Chang,Yin Fang,Yujia Qin,Wangchunshu Zhou,Yilun Zhao,Arman Cohan,Mark Gerstein

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.

翻译：尽管像GPT-4这样的大语言模型（LLMs）在函数级代码生成方面取得了令人瞩目的成果，但它们在仓库级代码理解（例如为调用例程提供正确的参数）方面仍存在困难，这需要对复杂的文件交互有更深层次的理解。此外，近期人们开发了试图与仓库代码交互（例如编译并评估其执行）的LLM智能体，这引发了对其实用性能进行评估的需求。这些不足促使我们开发了ML-Bench——一个植根于真实世界编程应用的基准测试，其利用现有代码仓库来执行任务。针对LLMs需要理解长代码上下文并将指令转化为精确、可执行脚本的需求，ML-Bench涵盖了跨越18个GitHub仓库的9,641个标注示例，挑战LLMs有效适应用户指定参数及文档复杂细节的能力。为同时评估LLMs与AI智能体，我们采用两种设置：ML-LLM-Bench用于在预定义部署环境中评估LLMs的文本到代码转换能力，而ML-Agent-Bench则在Linux沙箱环境中测试自主智能体的端到端任务执行能力。我们的研究结果表明，虽然GPT-4o以超过50%的Pass@5率领先，但仍有显著的改进空间，主要体现在幻觉输出和bash脚本生成困难等问题上。值得注意的是，在要求更高的ML-Agent-Bench中，GPT-4o取得了76.47%的成功率，这反映了迭代行动与反馈在复杂任务解决中的有效性。我们的代码、数据集和模型可在https://github.com/gersteinlab/ML-bench获取。