Compiler optimization level recognition can be applied to vulnerability discovery and binary analysis. Due to the exists of many different compilation optimization options, the difference in the contents of the binary file is very complicated. There are thousands of compiler optimization algorithms and multiple different processor architectures, so it is very difficult to manually analyze binary files and recognize its compiler optimization level with rules. This paper first proposes a CNN-based compiler optimization level recognition model: BinEye. The system extracts semantic and structural differences and automatically recognize the compiler optimization levels. The model is designed to be very suitable for binary file processing and is easy to understand. We built a dataset containing 80,028 binary files for the model training and testing. Our proposed model achieves an accuracy of over 97%. At the same time, BinEye is a fully CNN-based system and it has a faster forward calculation speed, at least 8 times faster than the normal RNN-based model. Through our analysis of the model output, we successfully found the difference in assembly codes caused by the different compiler optimization level. This means that the model we proposed is interpretable. Based on our model, we propose a method to analyze the code differences caused by different compiler optimization levels, which has great guiding significance for analyzing closed source compilers and binary security analysis.
翻译:编译器优化级别识别可应用于漏洞发现和二进制分析。由于存在众多不同的编译优化选项,二进制文件内容的差异非常复杂。编译器优化算法有数千种,且涉及多种不同处理器架构,因此人工分析二进制文件并基于规则识别其编译器优化级别极为困难。本文首次提出一种基于CNN的编译器优化级别识别模型:BinEye。该系统提取语义和结构差异,并自动识别编译器优化级别。该模型设计非常适合二进制文件处理,且易于理解。我们构建了一个包含80,028个二进制文件的数据集用于模型训练和测试。所提出的模型准确率超过97%。同时,BinEye是一个完全基于CNN的系统,具有更快的正向计算速度,至少比常规的RNN模型快8倍。通过对模型输出的分析,我们成功发现了不同编译器优化级别导致的汇编代码差异,这意味着所提出的模型具有可解释性。基于该模型,我们提出了一种分析不同编译器优化级别所导致代码差异的方法,这对分析闭源编译器和二进制安全分析具有重要的指导意义。