In this work, we introduce CodeRepoQA, a large-scale benchmark specifically designed for evaluating repository-level question-answering capabilities in the field of software engineering. CodeRepoQA encompasses five programming languages and covers a wide range of scenarios, enabling comprehensive evaluation of language models. To construct this dataset, we crawl data from 30 well-known repositories in GitHub, the largest platform for hosting and collaborating on code, and carefully filter raw data. In total, CodeRepoQA is a multi-turn question-answering benchmark with 585,687 entries, covering a diverse array of software engineering scenarios, with an average of 6.62 dialogue turns per entry. We evaluate ten popular large language models on our dataset and provide in-depth analysis. We find that LLMs still have limitations in question-answering capabilities in the field of software engineering, and medium-length contexts are more conducive to LLMs' performance. The entire benchmark is publicly available at https://github.com/kinesiatricssxilm14/CodeRepoQA.
翻译:本文提出了CodeRepoQA,一个专为评估软件工程领域仓库级问答能力而设计的大规模基准测试。CodeRepoQA涵盖五种编程语言及广泛的应用场景,能够对语言模型进行全面评估。为构建此数据集,我们从最大的代码托管与协作平台GitHub中爬取了30个知名仓库的数据,并对原始数据进行了细致筛选。CodeRepoQA总计包含585,687条多轮问答记录,覆盖多样化的软件工程场景,平均每轮对话包含6.62个话轮。我们在该数据集上评估了十种主流大语言模型,并提供了深入分析。研究发现,大语言模型在软件工程领域的问答能力仍存在局限,中等长度的上下文更有利于模型性能发挥。完整基准测试已公开于https://github.com/kinesiatricssxilm14/CodeRepoQA。