Analyzing the behavior of cryptographic functions in stripped binaries is a challenging but essential task. Cryptographic algorithms exhibit greater logical complexity compared to typical code, yet their analysis is unavoidable in areas such as virus analysis and legacy code inspection. Existing methods often rely on data or structural pattern matching, leading to suboptimal generalizability and suffering from manual work. In this paper, we propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries. In FoC, we first build a binary large language model (FoCBinLLM) to summarize the semantics of cryptographic functions in natural language. The prediction of FoC-BinLLM is insensitive to minor changes, such as vulnerability patches. To mitigate it, we further build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database. In addition, we construct a cryptographic binary dataset for evaluation and to facilitate further research in this domain. And an automated method is devised to create semantic labels for extensive binary functions. Evaluation results demonstrate that FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score. FoC-Sim outperforms the previous best methods with a 52% higher Recall@1. Furthermore, our method also shows practical ability in virus analysis and 1-day vulnerability detection.
翻译:分析剥离二进制文件中加密函数的行为是一项具有挑战性但至关重要的任务。与典型代码相比,加密算法展现出更高的逻辑复杂度,但在病毒分析、遗留代码审查等领域,对其分析不可避免。现有方法通常依赖数据或结构模式匹配,导致泛化能力欠佳且需要大量人工干预。本文提出一种名为FoC的新框架,以识别剥离二进制文件中的加密函数。在FoC中,我们首先构建一个二进制大语言模型(FoCBinLLM),用自然语言概括加密函数的语义。FoCBinLLM的预测对漏洞补丁等细微变化不敏感。为缓解此问题,我们进一步在FoCBinLLM基础上构建二进制代码相似性模型(FoC-Sim),生成对变化敏感的表示,并用于从数据库中检索未知加密函数的相似实现。此外,我们构建了加密二进制数据集用于评估,并推动该领域的进一步研究。我们还设计了一种自动化方法,为大量二进制函数创建语义标签。评估结果表明,FoCBinLLM在ROUGE-L得分上比ChatGPT高14.61%。FoC-Sim在Recall@1上比此前最优方法高52%。此外,我们的方法在病毒分析和1天漏洞检测中展现出实际应用能力。