Malicious software attacks are having an increasingly significant economic impact. Commercial malware detection software can be costly, and tools that attribute malware to the specific software vulnerabilities it exploits are largely lacking. Understanding the connection between malware and the vulnerabilities it targets is crucial for analyzing past threats and proactively defending against current ones. In this study, we propose an approach that leverages large language models (LLMs) to detect binary malware, specifically within JAR files, and uses LLM capabilities combined with retrieval-augmented generation (RAG) to identify Common Vulnerabilities and Exposures (CVEs) that malware may exploit. We developed a proof-of-concept tool, MalCVE, that integrates binary code decompilation, deobfuscation, LLM-based code summarization, semantic similarity search, and LLM-based CVE classification. We evaluated MalCVE using a benchmark dataset of 3,839 JAR executables. MalCVE achieved a mean malware-detection accuracy of 97%, at a fraction of the cost of commercial solutions. In particular, the results demonstrate that LLM-based code summarization enables highly accurate and explainable malware identification. MalCVE is also the first tool to associate CVEs with binary malware, achieving a recall@10 of 65%, which is comparable to studies that perform similar analyses on source code.
翻译:恶意软件攻击造成的经济影响日益显著。商业恶意软件检测工具成本高昂,且能够将恶意软件与其利用的特定软件漏洞相关联的工具极为缺乏。理解恶意软件与其目标漏洞之间的关联,对于分析历史威胁和主动防御当前攻击至关重要。本研究提出一种方法,利用大语言模型检测二进制恶意软件(特别是JAR文件中的恶意软件),并借助大语言模型的能力结合检索增强生成技术,识别恶意软件可能利用的通用漏洞披露条目。我们开发了概念验证工具MalCVE,该工具集成了二进制代码反编译、反混淆、基于LLM的代码摘要生成、语义相似性搜索以及基于LLM的CVE分类功能。我们使用包含3,839个JAR可执行文件的基准数据集对MalCVE进行评估。MalCVE实现了平均97%的恶意软件检测准确率,其成本仅为商业解决方案的极小部分。特别值得注意的是,结果表明基于LLM的代码摘要技术能够实现高精度且可解释的恶意软件识别。MalCVE也是首个实现二进制恶意软件与CVE关联的工具,其召回率@10达到65%,这一性能可与在源代码层面进行类似分析的研究成果相媲美。