LLM4Decompile: Decompiling Binary Code with Large Language Models

Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

翻译：反编译旨在将编译后的代码还原为人类可读的源代码，但在名称和结构等细节处理上存在困难。大语言模型（LLMs）在编程任务中展现出潜力，这促使人们将其应用于反编译领域。然而，目前尚未有开源的大语言模型用于反编译。此外，现有反编译评估系统主要关注词元级别的准确率，而严重忽略了代码可执行性——这是程序最重要的特征。为此，我们发布了首个开源反编译大语言模型（参数量从1B到33B不等），这些模型基于40亿个C语言源代码及其对应的汇编代码词元进行预训练。这些开源大语言模型可作为该领域进一步发展的基线基准。为确保程序评估的实用性，我们引入了Decompile-Eval——首个考虑反编译结果可重新编译性和可重新执行性的数据集。该基准强调了从程序语义角度评估反编译模型的重要性。实验表明，我们的LLM4Decompile已展现出准确反编译21%汇编代码的能力，较GPT-4实现了50%的提升。我们的代码、数据集和模型已在https://github.com/albertan017/LLM4Decompile公开。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日