The parallel evolution of Large Language Models (LLMs) with advanced code-understanding capabilities and the increasing sophistication of malware presents a new frontier for cybersecurity research. This paper evaluates the efficacy of state-of-the-art LLMs in classifying executable code as either benign or malicious. We introduce an automated pipeline that first decompiles Windows executable into a C code using Ghidra disassembler and then leverages LLMs to perform the classification. Our evaluation reveals that while standard LLMs show promise, they are not yet robust enough to replace traditional anti-virus software. We demonstrate that a fine-tuned model, trained on curated malware and benign datasets, significantly outperforms its vanilla counterpart. However, the performance of even this specialized model degrades notably when encountering newer malware. This finding demonstrates the critical need for continuous fine-tuning with emerging threats to maintain model effectiveness against the changing coding patterns and behaviors of malicious software.
翻译:大型语言模型(LLMs)在代码理解能力方面的并行发展,与恶意软件日益增长的复杂性,共同构成了网络安全研究的新前沿。本文评估了当前最先进的LLMs在将可执行代码分类为良性或恶意方面的效能。我们提出了一种自动化流程:首先使用Ghidra反汇编器将Windows可执行文件反编译为C代码,随后利用LLMs进行分类。评估结果表明,尽管标准LLMs展现出潜力,但其鲁棒性尚不足以取代传统的反病毒软件。我们证明,通过在精选的恶意软件与良性数据集上进行微调的模型,其性能显著优于未经微调的原始模型。然而,即使是这种专用模型,在遇到新型恶意软件时性能也会显著下降。这一发现表明,为了保持模型应对恶意软件不断变化的编码模式与行为的有效性,必须基于新出现的威胁进行持续的微调。