Automatically generating function summaries for binaries is an extremely valuable but challenging task, since it involves translating the execution behavior and semantics of the low-level language (assembly code) into human-readable natural language. However, most current works on understanding assembly code are oriented towards generating function names, which involve numerous abbreviations that make them still confusing. To bridge this gap, we focus on generating complete summaries for binary functions, especially for stripped binary (no symbol table and debug information in reality). To fully exploit the semantics of assembly code, we present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS. CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics. We evaluate CP-BCS on 3 different binary optimization levels (O1, O2, and O3) for 3 different computer architectures (X86, X64, and ARM). The evaluation results demonstrate CP-BCS is superior and significantly improves the efficiency of reverse engineering.
翻译:自动为二进制文件生成函数摘要是极具价值但充满挑战的任务,因其需要将底层语言(汇编代码)的执行行为与语义转化为人类可读的自然语言。然而,当前大多数汇编代码理解研究主要聚焦于生成函数名称,其中包含大量缩写词,导致其仍难以理解。为弥合这一差距,本研究聚焦于为二进制函数生成完整摘要,尤其针对剥离二进制文件(实际场景中不含符号表与调试信息)。为充分挖掘汇编代码语义,我们提出了一种基于控制流图与伪代码的二进制代码摘要生成框架CP-BCS。该框架利用双向指令级控制流图与融合专家知识的伪代码,学习二进制函数完整的执行行为与逻辑语义。我们在X86、X64、ARM三种计算机架构上,针对O1、O2、O3三种二进制优化等级进行了评估。实验结果表明,CP-BCS具有显著优越性,能大幅提升逆向工程效率。