Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100%. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
翻译:反编译旨在将二进制代码转换为高级源代码,但诸如Ghidra等传统工具往往生成难以阅读和执行的代码。受大型语言模型(LLMs)进展的启发,我们提出了LLM4Decompile,这是首个且规模最大的开源LLM系列(1.3B至33B),专门训练用于反编译二进制代码。我们优化了LLM训练过程,并引入了LLM4Decompile-End模型以直接反编译二进制文件。在HumanEval和ExeBench基准测试中,所得模型性能显著超越GPT-4o和Ghidra,提升幅度超过100%。此外,我们改进了标准精化方法,对LLM4Decompile-Ref模型进行微调,使其能够有效精化来自Ghidra的反编译代码,并在LLM4Decompile-End基础上进一步提升了16.2%。LLM4Decompile展示了LLMs在革新二进制代码反编译方面的潜力,在可读性和可执行性方面实现了显著提升,同时与传统工具互补以实现最优结果。我们的代码、数据集和模型已发布于https://github.com/albertan017/LLM4Decompile。