SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

from arxiv, Accepted to the EQUISA (Evaluation of Qualitative Aspects of Intelligent Software Assistants) workshop at EASE (Evaluation and Assessment in Software Engineering) 2026

Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-driven editing under executable test constraints. To address this, we propose SAFEdit, a multi-agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliability and reduce unintended code changes. A Planner Agent produces an explicit, visibility-aware edit plan, an Editor Agent applies minimal, literal code modifications, and a Verifier Agent executes real test runs. When tests fail, SAFEdit uses a Failure Abstraction Layer (FAL) to transform raw test logs into structured diagnostic feedback, which is fed back to the Editor to support iterative refinement. We compare SAFEdit against both prior single-model results reported for EditBench and an implemented ReAct single-agent baseline under the same evaluation conditions. We used EditBench to evaluate SAFEdit on 445 code editing instances in five languages (English, Polish, Spanish, Chinese, and Russian) under varying spatial context variants. SAFEdit achieved 68.6 percent TSR, outperforming the single-model baseline by 3.8 percentage points and the ReAct single-agent baseline by 8.6 percentage points. The iterative refinement loop was found to contribute 17.4 percentage points to SAFEdit's overall success rate. SAFEdit's automated error analysis further indicates a reduction in instruction-level hallucinations compared to single-agent approaches, providing an additional framework component for interpreting failures beyond pass or fail outcomes.

翻译：指令式代码编辑对大型语言模型（LLMs）而言是一项重大挑战。在EditBench基准测试中，40个受评估模型中有39个的任务成功率（TSR）低于60%，这凸显了通用代码生成与在可执行测试约束下执行指令驱动编辑能力之间的差距。为解决此问题，我们提出SAFEdit——一种面向指令式代码编辑的多智能体框架，该框架将编辑过程分解为专门化角色，以提高可靠性并减少非预期的代码变更。规划智能体生成显式且具备可见性意识的编辑方案，编辑智能体执行最小化的字面代码修改，验证智能体运行真实测试。当测试失败时，SAFEdit通过故障抽象层（FAL）将原始测试日志转化为结构化诊断反馈，并回传至编辑智能体以支持迭代优化。我们在相同评估条件下，将SAFEdit与EditBench先前报告的单模型结果以及实现的ReAct单智能体基线进行比较。利用EditBench，我们评估了SAFEdit在五种语言（英语、波兰语、西班牙语、中文和俄语）中445个代码编辑实例在不同空间上下文变体下的表现。SAFEdit实现了68.6%的TSR，超出单模型基线3.8个百分点，并超出ReAct单智能体基线8.6个百分点。迭代优化循环为SAFEdit整体成功率贡献了17.4个百分点。SAFEdit的自动错误分析进一步表明，与单智能体方法相比，指令级幻觉有所减少，从而提供了超越简单通过/失败结果的失败解读框架组件。