SWE-QA: Can Language Models Answer Repository-level Code Questions?

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

翻译：理解并推理整个软件仓库是构建智能软件工程工具的关键能力。尽管现有基准如CoSQA和CodeQA推动了该领域的发展，但它们主要关注小规模、自包含的代码片段。这些设置无法捕捉真实仓库的复杂性——在真实仓库中，有效理解与推理通常需要跨多个文件导航、理解软件架构，并基于长距离代码依赖关系定位答案。本文提出SWE-QA，一个仓库级代码问答基准，旨在促进在真实代码环境下自动化问答系统的研究。SWE-QA包含576个高质量问答对，涵盖意图理解、跨文件推理和多跳依赖分析等多种类别。为构建SWE-QA，我们首先从11个流行仓库中抓取77,100个GitHub问题。基于对这些问题中自然出现的开发者提问的分析，我们开发了仓库级问题的两层分类法，并为每个类别构建了一组种子问题。针对每个类别，我们手动整理并验证问题，同时收集对应的答案。作为原型应用，我们进一步开发SWE-QA-Agent框架，其中LLM智能体能够自主推理并采取行动查找答案。我们在多种上下文增强策略下评估了六种先进LLM在SWE-QA上的表现。实验结果凸显了LLM（特别是我们的SWE-QA-Agent框架）在解决仓库级问答问题上的潜力，同时也揭示了公开挑战并指明了未来研究方向。