ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Yuliang Liu,Xiangru Tang,Zefan Cai,Junjie Lu,Yichi Zhang,Yanjun Shao,Zexuan Deng,Helan Hu,Zengxian Yang,Kaikai An,Ruijun Huang,Shuzheng Si,Sheng Chen,Haozhe Zhao,Zhengliang Li,Liang Chen,Yiming Zong,Yan Wang,Tianyu Liu,Zhiwei Jiang,Baobao Chang,Yujia Qin,Wangchunshu Zhou,Yilun Zhao,Arman Cohan,Mark Gerstein

Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at \url{https://ml-bench.github.io/}.

翻译：大型语言模型在代码生成基准测试中展现出令人瞩目的性能。然而，这些基准测试的成果与实际应用之间存在显著差距，主要归因于现实编程对现有库的依赖性。本研究旨在提出一种新的评估场景，即评估LLM利用开源库完成机器学习任务的能力，而非从零开始编写代码。为此，我们提出了ML-Bench——一个用于衡量LLM在开源库中利用现有函数有效性的全面基准测试集。该基准包含10044个样本，涵盖14个知名机器学习GitHub仓库中的130项任务。在设置中，给定特定机器学习任务指令及代码库中的README文件，LLM需生成代码以完成该任务。这要求模型理解长文本与代码交织的文档，以及复杂的跨文件代码结构，从而引入了新挑战。值得注意的是，尽管GPT-4相比其他LLM有显著提升，但其仅能完成39.73%的任务，改进空间巨大。针对这些挑战，我们提出了ML-Agent，旨在高效导航代码库、定位文档、检索代码并生成可执行代码。实验结果表明，基于GPT-4构建的ML-Agent可进一步优化性能。代码、数据及模型详见\url{https://ml-bench.github.io/}。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日