Binary Decompilation LLM with Feedback-Driven Multi-Turn Refinement

Binary decompilation is fundamental to security tasks such as vulnerability discovery, malware inspection, and executable-only program understanding. Recent LLM-based decompilation methods have shown promising results, but most still follow a single-turn generation paradigm: given assembly code or decompiler-produced pseudo-code, the model generates one output and stops. Consequently, the generated code may appear readable or even compile successfully, yet still deviate from the behavior of the original binary and mislead downstream analysis. This paper presents AutoDecompiler, a decompilation-specialized LLM trained with reinforcement learning for feedback-driven multi-turn binary decompilation. Instead of treating decompilation as one-shot code generation, AutoDecompiler formulates it as an iterative refinement process, where the model revises generated code based on compilation, execution, and input/output testing feedback. To enable this process, we design decompilation-specific rewards that capture code validity, recompilability, execution consistency, and semantic fidelity. We further construct stage-aware diagnostic feedback from compiler errors, execution failures, and failed test cases, and introduce progress-aware trajectory rewarding and turn-aware advantage reweighting to encourage beneficial revisions while suppressing regressions. We train the AutoDecompiler family and evaluate it across different input settings, model scales, and benchmarks. Experimental results show that AutoDecompiler consistently outperforms its single-turn counterparts under the same model size and input setting, achieving clear improvements in behavioral re-executability. These results demonstrate that learning to exploit program feedback with reinforcement learning is an effective direction for improving the functional correctness of LLM-based binary decompilation.

翻译：二元反编译是安全领域的基础任务，例如漏洞发现、恶意软件检测及仅可执行程序理解。近期基于大语言模型的反编译方法已展现出初步成效，但大多仍遵循单轮生成范式：给定汇编代码或反编译器生成的伪代码，模型仅生成一次输出即终止。因此，生成的代码虽可能具备可读性甚至编译成功，却仍可能偏离原始二进制程序的行为，从而误导下游分析。本文提出AutoDecompiler——一种专门针对反编译任务的大语言模型，通过强化学习训练实现反馈驱动的多轮二元反编译。AutoDecompiler将反编译视为迭代精炼过程而非一次性代码生成：模型基于编译、执行及输入/输出测试的反馈持续修订生成的代码。为实现该过程，我们设计了反编译专属奖励机制，涵盖代码有效性、可重编译性、执行一致性及语义保真度。此外，我们根据编译器错误、执行失败及测试用例失败构建了分阶段诊断性反馈，并引入进度感知轨迹奖励与回合感知优势重加权机制，以激励有益改进同时抑制性能退化。我们训练了AutoDecompiler系列模型，并在不同输入设置、模型规模及基准测试中进行了评估。实验结果表明，在相同模型规模与输入设置下，AutoDecompiler始终优于其单轮对应版本，在行为可重执行性方面取得显著提升。这些结果证明，利用强化学习学习利用程序反馈是提升基于大语言模型的二元反编译功能正确性的有效方向。