Decompilers are widely used by security researchers and developers to reverse engineer executable code. While modern decompilers are adept at recovering instructions, control flow, and function boundaries, some useful information from the original source code, such as variable types and names, is lost during the compilation process. Our work aims to predict these variable types and names from the remaining information. We propose STRIDE, a lightweight technique that predicts variable names and types by matching sequences of decompiler tokens to those found in training data. We evaluate it on three benchmark datasets and find that STRIDE achieves comparable performance to state-of-the-art machine learning models for both variable retyping and renaming while being much simpler and faster. We perform a detailed comparison with two recent SOTA transformer-based models in order to understand the specific factors that make our technique effective. We implemented STRIDE in fewer than 1000 lines of Python and have open-sourced it under a permissive license at https://github.com/hgarrereyn/STRIDE.
翻译:反编译器被安全研究人员和开发者广泛用于逆向工程可执行代码。虽然现代反编译器擅长恢复指令、控制流和函数边界,但原始源代码中的一些有用信息(如变量类型和名称)在编译过程中丢失了。我们的工作旨在根据剩余信息预测这些变量的类型和名称。我们提出了STRIDE,一种轻量级技术,通过将反编译器生成的标记序列与训练数据中的序列进行匹配来预测变量名称和类型。我们在三个基准数据集上对其进行了评估,发现STRIDE在变量类型重定和重命名任务上达到了与最先进的机器学习模型相当的性能,同时更为简单和快速。为了理解我们技术有效的具体因素,我们与两种最新的基于Transformer的最先进模型进行了详细比较。我们用不到1000行Python代码实现了STRIDE,并在https://github.com/hgarrereyn/STRIDE上以宽松许可证开源。