The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
翻译:问题解决任务旨在修改代码库以生成解决给定问题的补丁。然而,现有基准测试(如SWE-bench)几乎完全专注于Python,这使得它们不足以评估大型语言模型(LLMs)在多样化软件生态系统中的表现。为此,我们引入了一个多语言问题解决基准测试,称为Multi-SWE-bench,涵盖Java、TypeScript、JavaScript、Go、Rust、C和C++。它总共包含1,632个高质量实例,这些实例由68位专家标注者从2,456个候选实例中精心标注而成,确保该基准测试能够提供准确可靠的评估。基于Multi-SWE-bench,我们使用三种代表性方法(Agentless、SWE-agent和OpenHands)评估了一系列最先进的模型,并提供了包含关键实证见解的全面分析。此外,我们启动了Multi-SWE-RL开源社区,旨在为问题解决任务构建大规模强化学习(RL)训练数据集。作为初步贡献,我们发布了一组包含4,723个结构良好的实例,涵盖七种编程语言,为该领域的RL研究奠定了坚实基础。更重要的是,我们开源了整个数据生产流程以及详细教程,鼓励开源社区持续贡献并扩展数据集。我们期望我们的Multi-SWE-bench和不断发展的Multi-SWE-RL社区能够成为推动RL充分发挥其潜力的催化剂,让我们更接近通用人工智能(AGI)的黎明。