Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents

The Rust programming language presents a steep learning curve and significant coding challenges, making the automation of issue resolution essential for its broader adoption. Recently, LLM-powered code agents have shown remarkable success in resolving complex software engineering tasks, yet their application to Rust has been limited by the absence of a large-scale, repository-level benchmark. To bridge this gap, we introduce Rust-SWE-bench, a benchmark comprising 500 real-world, repository-level software engineering tasks from 34 diverse and popular Rust repositories. We then perform a comprehensive study on Rust-SWE-bench with four representative agents and four state-of-the-art LLMs to establish a foundational understanding of their capabilities and limitations in the Rust ecosystem. Our extensive study reveals that while ReAct-style agents are promising, i.e., resolving up to 21.2% of issues, they are limited by two primary challenges: comprehending repository-wide code structure and complying with Rust's strict type and trait semantics. We also find that issue reproduction is rather critical for task resolution. Inspired by these findings, we propose RUSTFORGER, a novel agentic approach that integrates an automated test environment setup with a Rust metaprogramming-driven dynamic tracing strategy to facilitate reliable issue reproduction and dynamic analysis. The evaluation shows that RUSTFORGER using Claude-Sonnet-3.7 significantly outperforms all baselines, resolving 28.6% of tasks on Rust-SWE-bench, i.e., a 34.9% improvement over the strongest baseline, and, in aggregate, uniquely solves 46 tasks that no other agent could solve across all adopted advanced LLMs.

翻译：Rust编程语言具有陡峭的学习曲线和显著的编码挑战，这使得自动化问题解决对其广泛采用至关重要。近期，基于LLM的代码智能体在解决复杂软件工程任务方面取得了显著成功，但由于缺乏大规模仓库级基准测试，其在Rust领域的应用受到限制。为弥补这一空白，我们提出了Rust-SWE-bench基准测试，该基准包含来自34个多样化且流行的Rust仓库的500个真实世界仓库级软件工程任务。随后，我们使用四种代表性智能体和四种先进LLM在Rust-SWE-bench上进行了全面研究，以建立对其在Rust生态系统中能力与局限性的基础认知。我们的广泛研究表明：虽然ReAct风格智能体展现出潜力（最高可解决21.2%的问题），但其主要受限于两大挑战：理解仓库级代码结构以及遵守Rust严格的类型与特征语义。我们还发现问题复现对于任务解决至关重要。基于这些发现，我们提出了RUSTFORGER——一种创新智能体方法，该方法将自动化测试环境配置与Rust元编程驱动的动态追踪策略相结合，以实现可靠的问题复现与动态分析。评估结果表明：采用Claude-Sonnet-3.7的RUSTFORGER显著优于所有基线方法，在Rust-SWE-bench上解决了28.6%的任务（即比最强基线提升34.9%），并且累计解决了46个其他智能体在所有采用的先进LLM中均无法解决的任务。