Decompilation aims to recover the source code form of a binary executable. It has many security applications such as malware analysis, vulnerability detection and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while suppressing potential hallucinations and overcoming the input token limitation. We build a prototype, GenNm, from a pre-trained generative model Code-Llama. We fine-tune GenNm on decompiled functions, and leverage program analysis to validate the results produced by the generative model. GenNm includes names from callers and callees while querying a function, providing rich contextual information within the model's input token limitation. Our results show that GenNm improves the state-of-the-art from 48.1% to 57.9% in the most challenging setup where a query function is not seen in the training dataset.
翻译:反编译旨在恢复二进制可执行文件的源代码形式,在恶意软件分析、漏洞检测和代码加固等安全领域具有广泛应用。反编译的一个突出难题是恢复变量名。我们提出一种新颖技术,既能充分发挥生成模型的优势,又能抑制潜在幻觉并克服输入令牌限制。基于预训练生成模型Code-Llama,我们构建了原型系统GenNm,通过对反编译函数进行微调,并利用程序分析验证生成模型的结果。在查询函数时,GenNm会整合调用方与被调用方的变量名,在模型输入令牌限制内提供丰富的上下文信息。实验结果表明,在查询函数未出现在训练数据集中的最具挑战性场景下,GenNm将当前最优方法的准确率从48.1%提升至57.9%。