Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.

翻译：大型语言模型（LLM）在软件工程任务（包括问题回答（QA））中展现了令人印象深刻的能力。然而，大多数研究和基准测试仅聚焦于孤立的函数或单文件代码片段，忽略了实际程序理解中的挑战，这些挑战往往涉及多个文件及系统级依赖关系。在本工作中，我们引入了StackRepoQA——首个跨项目、仓库级的问题回答数据集，该数据集基于134个开源Java项目中的1,318个真实开发者问题及其已采纳答案构建而成。利用该数据集，我们在直接提示和代理配置下，系统评估了两种广泛使用的LLM（Claude 3.5 Sonnet和GPT-4o）。我们将基线性能与利用文件级检索和基于图的结构依赖关系表示的检索增强生成方法进行了对比。研究结果表明，LLM在基线水平上达到了中等准确率，而纳入结构信号后性能有所提升。尽管如此，在仓库级理解任务中，整体准确率仍然有限。分析揭示，高得分往往源于对Stack Overflow答案的逐字复制，而非真正的推理。据我们所知，这是首个在仓库级QA中提供此类证据的实证研究。我们公开StackRepoQA数据集，旨在鼓励对基准测试、评估协议及增强策略的进一步研究，以区分记忆与推理，推动LLM成为仓库级程序理解的可靠工具。