A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on general code generation without considering API-oriented code generation, i.e., generating code that invokes APIs from specific libraries. Given the growing demand for API-oriented code generation, there is a pressing need for a systematic and automated approach to evaluate LLM on API-oriented code generation. To address this gap, we propose AutoAPIEval, a lightweight and automated framework designed to evaluate the capabilities of LLMs in API-oriented code generation. Our framework works with any library that provides API documentation and focuses on two unit tasks: API recommendation and code example generation, along with four metrics to evaluate the generated APIs and code examples, such as the proportion of incorrect API recommendations for Task 1, and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples for Task 2. In addition, we conducted a case study on three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) and Java Runtime Environment 8 to demonstrate the framework's effectiveness. Our findings reveal substantial variability in LLM performance across tasks, with ChatGPT adhering better to instructions, while sharing similar effectiveness in code example generation with its counterparts (i.e., MagiCoder and DeekSeek Coder). We also identify key factors associated with code quality, such as API popularity and model confidence, and build classifiers that achieve high accuracy in detecting incorrect API recommendations and erroneous code examples. Retrieval-augmented generation enhances the quality of code generated by LLMs, though its effectiveness varies across different LLMs.

翻译：诸如GitHub Copilot和ChatGPT等大型语言模型已成为强大的代码生成工具，显著提升了生产力并加速了软件开发。然而，现有基准测试主要关注通用代码生成，未充分考虑API导向的代码生成，即生成调用特定库中API的代码。鉴于对API导向代码生成日益增长的需求，迫切需要一种系统化、自动化的方法来评估大型语言模型在API导向代码生成方面的能力。为填补这一空白，我们提出了AutoAPIEval，一个轻量级自动化框架，旨在评估大型语言模型在API导向代码生成方面的能力。我们的框架适用于任何提供API文档的库，并聚焦于两个单元任务：API推荐和代码示例生成，同时采用四项指标来评估生成的API和代码示例，例如任务一中错误API推荐的比例，以及任务二中未调用特定API的代码示例比例和无法编译/执行的代码示例比例。此外，我们对三种大型语言模型（ChatGPT、MagiCoder和DeepSeek Coder）及Java Runtime Environment 8进行了案例研究，以验证该框架的有效性。我们的研究结果表明，不同大型语言模型在各项任务上的表现存在显著差异：ChatGPT在遵循指令方面表现更佳，而在代码示例生成方面与其他模型（即MagiCoder和DeepSeek Coder）具有相似的有效性。我们还识别了与代码质量相关的关键因素，如API流行度和模型置信度，并构建了能够以高准确率检测错误API推荐和错误代码示例的分类器。检索增强生成技术提升了大型语言模型生成代码的质量，但其有效性在不同模型间存在差异。