While much of the current research in deep learning-based vulnerability detection relies on disassembled binaries, this paper explores the feasibility of extracting features directly from raw x86-64 machine code. Although assembly language is more interpretable for humans, it requires more complex models to capture token-level context. In contrast, machine code may enable more efficient, lightweight models and preserve all information that might be lost in disassembly. This paper approaches the task of vulnerability detection through an exploratory study on two specific deep learning model architectures and aims to systematically evaluate their performance across three vulnerability types. The results demonstrate that graph-based models consistently outperform sequential models, emphasizing the importance of control flow relationships, and that machine code contains sufficient information for effective vulnerability discovery.
翻译:当前基于深度学习的漏洞检测研究大多依赖于反汇编后的二进制文件,本文则探索直接从原始x86-64机器码中提取特征的可行性。虽然汇编语言对人类更易解读,但其需要更复杂的模型来捕获词元级上下文。相比之下,机器码可能支持更高效、轻量级的模型,并保留反汇编过程中可能丢失的所有信息。本文通过针对两种特定深度学习模型架构的探索性研究来处理漏洞检测任务,旨在系统评估它们在三种漏洞类型上的性能。结果表明,基于图的模型始终优于序列模型,突显了控制流关系的重要性,同时证明机器码包含足够信息以实现有效的漏洞发现。