Identifying Non-Control Security-Critical Data through Program Dependence Learning

As control-flow protection gets widely deployed, it is difficult for attackers to corrupt control-data and achieve control-flow hijacking. Instead, data-oriented attacks, which manipulate non-control data, have been demonstrated to be feasible and powerful. In data-oriented attacks, a fundamental step is to identify non-control, security-critical data. However, critical data identification processes are not scalable in previous works, because they mainly rely on tedious human efforts to identify critical data. To address this issue, we propose a novel approach that combines traditional program analysis with deep learning. At a higher level, by examining how analysts identify critical data, we first propose dynamic analysis algorithms to identify the program semantics (and features) that are correlated with the impact of a critical data. Then, motivated by the unique challenges in the critical data identification task, we formalize the distinguishing features and use customized program dependence graphs (PDG) to embed the features. Different from previous works using deep learning to learn basic program semantics, this paper adopts a special neural network architecture that can capture the long dependency paths (in the PDG), through which a critical variable propagates its impact. We have implemented a fully-automatic toolchain and conducted comprehensive evaluations. According to the evaluations, our model can achieve 90% accuracy. The toolchain uncovers 80 potential critical variables in Google FuzzBench. In addition, we demonstrate the harmfulness of the exploits using the identified critical variables by simulating 7 data-oriented attacks through GDB.

翻译：随着控制流保护技术的广泛部署，攻击者难以篡改控制数据并实现控制流劫持。相反，以操控非控制数据为目标的数据导向攻击已被证明既可行又强大。在数据导向攻击中，识别非控制的安全关键数据是基础步骤。然而，以往工作中关键数据识别过程的可扩展性不足，主要依赖繁琐的人工识别。为解决此问题，我们提出一种结合传统程序分析与深度学习的新方法。从宏观层面，通过分析分析人员识别关键数据的方式，我们首先提出动态分析算法，用于识别与关键数据影响相关的程序语义（及特征）。随后，针对关键数据识别任务中的独特挑战，我们形式化区分性特征，并使用自定义程序依赖图（PDG）嵌入这些特征。不同于以往利用深度学习学习基础程序语义的工作，本文采用一种特殊神经网络架构，能够捕捉关键变量通过PDG传播影响的长依赖路径。我们实现了全自动工具链并开展全面评估。评估显示，模型准确率达90%。该工具链在Google FuzzBench中发现了80个潜在关键变量。此外，我们通过GDB模拟了7种数据导向攻击，验证了利用所识别关键变量进行攻击的危害性。