CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering

Large language models that enhance software development tasks, such as code generation, code completion, and code question answering (QA), have been extensively studied in both academia and the industry. The models are integrated into popular intelligent IDEs like JetBrains and Cursor. Current benchmarks for evaluating models' code comprehension capabilities primarily focus on code generation or completion, often neglecting QA, which is a crucial aspect of understanding code. Existing code QA benchmarks are derived from code comments with predefined patterns (e.g., CodeQA) or focus on specific domains, such as education (e.g., CS1QA). These benchmarks fail to capture the real-world complexity of software engineering and user requirements for understanding code repositories. To address this gap, we introduce CoReQA, a benchmark for Code Repository-level question answering, constructed from GitHub issues and comments from 176 popular repositories across four programming languages. Since questions and answers may include both natural language and code snippets, traditional evaluation metrics such as BLEU are inadequate for assessing repository-level QA performance. Thus, we provide an LLM-as-a-judge framework to evaluate QA performance from five aspects. Based on CoReQA, we evaluate the performance of three baselines, including two short-context models using generic retrieval strategies and one long-context model that utilizes the entire repository context. Evaluation results show that state-of-the-art proprietary and long-context models struggle to address repository-level questions effectively. Our analysis highlights the limitations of language models in assisting developers in understanding repositories and suggests future directions for improving repository comprehension systems through effective context retrieval methodologies.

翻译：增强软件开发任务的大型语言模型，例如代码生成、代码补全和代码问答，已在学术界和工业界得到广泛研究。这些模型已集成到诸如JetBrains和Cursor等流行的智能IDE中。当前评估模型代码理解能力的基准主要集中于代码生成或补全，常常忽略了问答这一理解代码的关键方面。现有的代码问答基准来源于具有预定义模式的代码注释（例如CodeQA）或专注于特定领域，例如教育（例如CS1QA）。这些基准未能捕捉软件工程的真实复杂性以及用户理解代码库的需求。为填补这一空白，我们引入了CoReQA，一个用于代码库级别问答的基准，该基准构建自跨越四种编程语言的176个热门GitHub仓库的议题和评论。由于问题和答案可能同时包含自然语言和代码片段，传统的评估指标如BLEU不足以评估库级问答性能。因此，我们提供了一个LLM-as-a-judge框架，从五个方面评估问答性能。基于CoReQA，我们评估了三个基线的性能，包括两个使用通用检索策略的短上下文模型和一个利用整个仓库上下文的长上下文模型。评估结果表明，最先进的专有模型和长上下文模型难以有效处理库级别的问题。我们的分析揭示了语言模型在协助开发者理解仓库方面的局限性，并提出了通过有效的上下文检索方法来改进仓库理解系统的未来方向。