Language model approaches have recently been integrated into binary analysis tasks, such as function similarity detection and function signature recovery. These models typically employ a two-stage training process: pre-training via Masked Language Modeling (MLM) on machine code and fine-tuning for specific tasks. While MLM helps to understand binary code structures, it ignores essential code characteristics, including control and data flow, which negatively affect model generalization. Recent work leverages domain-specific features (e.g., control flow graphs and dynamic execution traces) in transformer-based approaches to improve binary code semantic understanding. However, this approach involves complex feature engineering, a cumbersome and time-consuming process that can introduce predictive uncertainty when dealing with stripped or obfuscated code, leading to a performance drop. In this paper, we introduce ProTST, a novel transformer-based methodology for binary code embedding. ProTST employs a hierarchical training process based on a unique tree-like structure, where knowledge progressively flows from fundamental tasks at the root to more specialized tasks at the leaves. This progressive teacher-student paradigm allows the model to build upon previously learned knowledge, resulting in high-quality embeddings that can be effectively leveraged for diverse downstream binary analysis tasks. The effectiveness of ProTST is evaluated in seven binary analysis tasks, and the results show that ProTST yields an average validation score (F1, MRR, and Recall@1) improvement of 14.8% compared to traditional two-stage training and an average validation score of 10.7% compared to multimodal two-stage frameworks.
翻译:语言模型方法最近已被整合到二进制分析任务中,例如函数相似性检测和函数签名恢复。这些模型通常采用两阶段训练过程:通过机器代码上的掩码语言建模进行预训练,然后针对特定任务进行微调。虽然掩码语言建模有助于理解二进制代码结构,但它忽略了代码的基本特征,包括控制流和数据流,这会对模型的泛化能力产生负面影响。最近的研究在基于Transformer的方法中利用领域特定特征(例如控制流图和动态执行轨迹)来改进二进制代码语义理解。然而,这种方法涉及复杂的特征工程,这是一个繁琐且耗时的过程,在处理剥离或混淆的代码时可能引入预测不确定性,导致性能下降。在本文中,我们提出了ProTST,一种基于Transformer的新型二进制代码嵌入方法。ProTST采用基于独特树状结构的分层训练过程,其中知识从根节点的基础任务逐步流向叶节点的更专门化任务。这种渐进的师生范式使模型能够建立在先前学习的知识之上,从而产生高质量的嵌入,可有效用于各种下游二进制分析任务。ProTST的有效性在七项二进制分析任务中进行了评估,结果表明,与传统两阶段训练相比,ProTST的平均验证分数(F1、MRR和Recall@1)提高了14.8%;与多模态两阶段框架相比,平均验证分数提高了10.7%。