CodeRAG-Bench: Can Retrieval Augment Code Generation?

While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.

翻译：尽管语言模型（LM）在生成代码方面已展现出卓越能力，但许多程序仅凭其参数化知识仍难以生成。提供外部上下文（如库文档）有助于生成准确且可运行的代码。尽管检索增强生成（RAG）在各种文本导向任务中取得了成功，但其在改进代码生成方面的潜力仍未得到充分探索。本研究通过系统性大规模分析探讨以下问题：在何种场景下检索能有益于代码生成模型？以及仍存在哪些挑战？我们首先构建了一个综合性评估基准——CodeRAG-Bench，涵盖三类代码生成任务：基础编程、开放域及仓库级问题。我们聚合了五个来源的文档供模型检索上下文：竞赛解决方案、在线教程、库文档、StackOverflow帖子和GitHub仓库。通过提供从单一或多个来源检索的上下文，我们在CodeRAG-Bench上测试了顶尖模型。尽管在不同设置下通过检索高质量上下文使最终代码生成取得了显著提升，但分析表明仍有改进空间——当前检索器在获取有用上下文方面仍存在困难（尤其在词汇重叠有限时），且生成器在上下文长度受限或整合额外上下文能力不足时无法获得改进。我们希望CodeRAG-Bench能作为一个有效的测试平台，推动面向代码的高级RAG方法的进一步发展。