Being able to identify functions of interest in cross-architecture software is useful whether you are analysing for malware, securing the software supply chain or conducting vulnerability research. Cross-Architecture Binary Code Similarity Search has been explored in numerous studies and has used a wide range of different data sources to achieve its goals. The data sources typically used draw on common structures derived from binaries such as function control flow graphs or binary level call graphs, the output of the disassembly process or the outputs of a dynamic analysis approach. One data source which has received less attention is binary intermediate representations. Binary Intermediate representations possess two interesting properties: they are cross architecture by their very nature and encode the semantics of a function explicitly to support downstream usage. Within this paper we propose Function as a String Encoded Representation (FASER) which combines long document transformers with the use of intermediate representations to create a model capable of cross architecture function search without the need for manual feature engineering, pre-training or a dynamic analysis step. We compare our approach against a series of baseline approaches for two tasks; A general function search task and a targeted vulnerability search task. Our approach demonstrates strong performance across both tasks, performing better than all baseline approaches.
翻译:能够在跨架构软件中识别感兴趣的函数,对于恶意软件分析、软件供应链安全加固或漏洞研究均具有重要意义。跨架构二进制代码相似性搜索已在多项研究中得到探索,并利用多种不同数据源实现其目标。通常使用的数据源依托于从二进制文件中提取的通用结构,例如函数控制流图、二进制级调用图、反汇编过程的输出或动态分析方法的输出。其中,二进制中间表示这一数据源受到的关注相对较少。二进制中间表示具有两个有趣的特性:它们天然具有跨架构性,并且显式地编码了函数的语义以支持下游使用。本文提出了“函数作为字符串编码表示”(FASER),该方法将长文档变换器与中间表示相结合,构建了一个无需手动特征工程、预训练或动态分析步骤即可实现跨架构函数搜索的模型。我们将我们的方法与一系列基线方法在两个任务上进行了比较:通用函数搜索任务和针对性漏洞搜索任务。我们的方法在这两个任务中均表现出色,性能优于所有基线方法。