Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
翻译:反编译旨在将编译后的代码还原为人类可读的源代码,但在名称和结构等细节处理上存在困难。大语言模型(LLMs)在编程任务中展现出潜力,这促使人们将其应用于反编译领域。然而,目前尚未有开源的大语言模型用于反编译。此外,现有反编译评估系统主要关注词元级别的准确率,而严重忽略了代码可执行性——这是程序最重要的特征。为此,我们发布了首个开源反编译大语言模型(参数量从1B到33B不等),这些模型基于40亿个C语言源代码及其对应的汇编代码词元进行预训练。这些开源大语言模型可作为该领域进一步发展的基线基准。为确保程序评估的实用性,我们引入了Decompile-Eval——首个考虑反编译结果可重新编译性和可重新执行性的数据集。该基准强调了从程序语义角度评估反编译模型的重要性。实验表明,我们的LLM4Decompile已展现出准确反编译21%汇编代码的能力,较GPT-4实现了50%的提升。我们的代码、数据集和模型已在https://github.com/albertan017/LLM4Decompile公开。