Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at \url{https://ml-bench.github.io/}.
翻译:大型语言模型在代码生成基准测试中展现出令人瞩目的性能。然而,这些基准测试的成果与实际应用之间存在显著差距,主要归因于现实编程对现有库的依赖性。本研究旨在提出一种新的评估场景,即评估LLM利用开源库完成机器学习任务的能力,而非从零开始编写代码。为此,我们提出了ML-Bench——一个用于衡量LLM在开源库中利用现有函数有效性的全面基准测试集。该基准包含10044个样本,涵盖14个知名机器学习GitHub仓库中的130项任务。在设置中,给定特定机器学习任务指令及代码库中的README文件,LLM需生成代码以完成该任务。这要求模型理解长文本与代码交织的文档,以及复杂的跨文件代码结构,从而引入了新挑战。值得注意的是,尽管GPT-4相比其他LLM有显著提升,但其仅能完成39.73%的任务,改进空间巨大。针对这些挑战,我们提出了ML-Agent,旨在高效导航代码库、定位文档、检索代码并生成可执行代码。实验结果表明,基于GPT-4构建的ML-Agent可进一步优化性能。代码、数据及模型详见\url{https://ml-bench.github.io/}。