Reverse Engineering Structure and Semantics of Input of a Binary Executable

Knowledge of the input format of binary executables is important for finding bugs and vulnerabilities, such as generating data for fuzzing or manual reverse engineering. This paper presents an algorithm to recover the structure and semantic relations between fields of the input of binary executables using dynamic taint analysis. The algorithm improves upon prior work by not just partitioning the input into consecutive bytes representing values but also identifying syntactic components of structures, such as atomic fields of fixed and variable lengths, and different types of arrays, such as arrays of atomic fields, arrays of records, and arrays with variant records. It also infers the semantic relations between fields of a structure, such as count fields that specify the count of an array of records or offset fields that specify the start location of a variable-length field within the input data. The algorithm constructs a C/C++-like structure to represent the syntactic components and semantic relations. The algorithm was implemented in a prototype system named ByteRI 2.0. The system was evaluated using a controlled experiment with synthetic subject programs and real-world programs. The subject programs were created to accept a variety of input formats that mimic syntactic components and selected semantic relations found in conventional data formats, such as PE, PNG, ZIP, and CSV. The results show that ByteRI 2.0 correctly identifies the syntactic elements and their grammatical structure, as well as the semantic relations between the fields for both synthetic subject programs and real-world programs. The recovered structures, when used as a generator, produced valid data that was acceptable for all the synthetic subject programs and some of the real-world programs.

翻译：了解二进制可执行文件的输入格式对于发现程序缺陷与安全漏洞至关重要，例如为模糊测试或人工逆向工程生成数据。本文提出一种算法，利用动态污点分析技术，恢复二进制可执行文件输入中各字段间的结构关系与语义关联。该算法在先前研究基础上进行了改进：不仅将输入划分为表示值的连续字节段，还能识别结构中的语法成分，包括定长与变长原子字段，以及不同类型的数组（如原子字段数组、记录数组和变体记录数组）。算法同时推断结构字段间的语义关系，例如指定记录数组长度的计数字段，或指示变长字段在输入数据中起始位置的偏移字段。算法构建了一个类C/C++的结构体来表示这些语法成分与语义关系。该算法已在名为ByteRI 2.0的原型系统中实现。系统通过包含合成测试程序与真实世界程序的受控实验进行评估。测试程序被设计为接受多种模拟常规数据格式（如PE、PNG、ZIP和CSV）中常见语法成分及选定语义关系的输入格式。实验结果表明，ByteRI 2.0能准确识别合成程序与真实程序输入中的语法元素及其语法结构，以及字段间的语义关系。恢复所得结构在用作数据生成器时，能为所有合成测试程序及部分真实程序产生有效的可接受数据。