While large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. This includes understanding dependencies, project structures, and managing multi-file changes. However, the ability of LLMs to effectively comprehend and handle complex code repositories has yet to be fully explored. To address challenges, we introduce a hierarchical benchmark designed to evaluate repository dependency understanding (DependEval). Benchmark is based on 15,576 repositories collected from real-world websites. It evaluates models on three core tasks: Dependency Recognition, Repository Construction, and Multi-file Editing, across 8 programming languages from actual code repositories. Our evaluation of over 25 LLMs reveals substantial performance gaps and provides valuable insights into repository-level code understanding.
翻译:尽管大型语言模型(LLMs)在代码生成方面展现出显著潜力,但实际软件开发需要更高级的仓库级推理能力,包括理解依赖关系、项目结构以及管理多文件变更。然而,LLMs有效理解和处理复杂代码仓库的能力尚未得到充分探索。为应对这些挑战,我们提出了一个分层基准测试框架(DependEval),旨在评估代码仓库依赖理解能力。该基准基于从真实世界网站收集的15,576个代码仓库构建,通过依赖识别、仓库重构和多文件编辑三项核心任务,在涵盖8种编程语言的实际代码仓库中对模型进行评估。我们对超过25个LLMs的评估揭示了显著的性能差距,并为仓库级代码理解提供了重要见解。