Decompilation aims to recover the source code form of a binary executable. It has many applications in security and software engineering such as malware analysis, vulnerability detection and code reuse. A prominent challenge in decompilation is to recover variable names. We propose a novel method that leverages the synergy of large language model (LLM) and program analysis. Language models encode rich multi-modal knowledge, but its limited input size prevents providing sufficient global context for name recovery. We propose to divide the task to many LLM queries and use program analysis to correlate and propagate the query results, which in turn improves the performance of LLM by providing additional contextual information. Our results show that 75% of the recovered names are considered good by users and our technique outperforms the state-of-the-art technique by 16.5% and 20.23% in precision and recall, respectively.
翻译:反编译旨在恢复二进制可执行文件的源代码形式,在安全与软件工程领域(如恶意软件分析、漏洞检测和代码重用)具有广泛应用。反编译面临的一个突出挑战是恢复变量名称。我们提出了一种新方法,通过结合大语言模型与程序分析的协同作用来实现。语言模型编码了丰富的多模态知识,但其有限的输入大小阻碍了为名称恢复提供充分的全局上下文信息。我们建议将任务划分为多个大语言模型查询,并利用程序分析来关联和传播查询结果,这反过来通过提供额外的上下文信息来提升大语言模型的性能。我们的结果表明,75% 的恢复名称被用户视为良好,且我们的技术在精确率和召回率上分别比最先进的技术高出 16.5% 和 20.23%。