Binary code analysis plays a pivotal role in various software security applications, such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, understanding binary code is challenging for reverse engineers due to the absence of semantic information. Therefore, automated tools are needed to assist human players in interpreting binary code. In recent years, two groups of technologies have shown promising prospects: (1) Deep learning-based technologies have demonstrated competitive results in tasks related to binary code understanding, furthermore, (2) Large Language Models (LLMs) have been extensively pre-trained at the source-code level for tasks such as code understanding and generation. This makes participants wonder about the ability of LLMs in binary code understanding. In this work, we propose a benchmark to evaluate the effectiveness of LLMs in real-world reverse engineering scenarios. The benchmark covers two key binary code understanding tasks, including function name recovery and binary code summarization. We gain valuable insights into their capabilities and limitations through extensive evaluations of popular LLMs using our benchmark. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis. Our results highlight the great potential of the LLMs in advancing the field of binary code understanding.
翻译:二进制代码分析在软件安全的各种应用中扮演着关键角色,例如软件维护、恶意软件检测、软件漏洞发现、补丁分析等。然而,与源代码不同,由于缺乏语义信息,逆向工程师理解二进制代码颇具挑战性。因此,需要自动化工具来辅助人类参与者解读二进制代码。近年来,两类技术展现出光明前景:(1)基于深度学习的技术在与二进制代码理解相关的任务中表现出竞争性成果;(2)大语言模型已被广泛在源代码级别进行预训练,用于代码理解和生成等任务。这使参与者思考大语言模型在二进制代码理解方面的能力。在本工作中,我们提出一个基准测试,用于评估大语言模型在真实逆向工程场景中的有效性。该基准测试涵盖两项关键的二进制代码理解任务,包括函数名恢复和二进制代码摘要。通过对流行大语言模型使用我们的基准测试进行广泛评估,我们获得了关于其能力和局限性的宝贵见解。我们的评估揭示,现有大语言模型能够在某种程度上理解二进制代码,从而提升二进制代码分析的效率。我们的结果凸显了大语言模型在推动二进制代码理解领域发展方面的巨大潜力。