Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
翻译:当前,代码与自然语言处理领域正在快速发展。特别是模型在处理长上下文窗口方面表现日益提升——近年来支持的上下文长度已呈数量级增长。然而,现有代码处理基准测试大多局限于单个文件上下文,最常用的基准甚至仅限单个方法。本研究旨在通过引入长代码竞技场来填补这一空白,该套件包含六个需要项目级上下文的代码处理任务基准。这些任务涵盖代码处理的不同维度:基于库的代码生成、CI构建修复、项目级代码补全、提交信息生成、缺陷定位和模块摘要。针对每个任务,我们提供了经过人工验证的测试数据集、评估套件,以及基于流行大语言模型的开源基线解决方案,以展示数据集使用方法并简化其他研究者的采用流程。我们在HuggingFace Spaces上发布了包含排行榜的基准页面,所有数据集的HuggingFace Hub链接,以及包含基线模型的GitHub仓库链接:https://huggingface.co/spaces/JetBrains-Research/long-code-arena。