Malware authors often employ code obfuscations to make their malware harder to detect. Existing tools for generating obfuscated code often require access to the original source code (e.g., C++ or Java), and adding new obfuscations is a non-trivial, labor-intensive process. In this study, we ask the following question: Can Large Language Models (LLMs) potentially generate a new obfuscated assembly code? If so, this poses a risk to anti-virus engines and potentially increases the flexibility of attackers to create new obfuscation patterns. We answer this in the affirmative by developing the MetamorphASM benchmark comprising MetamorphASM Dataset (MAD) along with three code obfuscation techniques: dead code, register substitution, and control flow change. The MetamorphASM systematically evaluates the ability of LLMs to generate and analyze obfuscated code using MAD, which contains 328,200 obfuscated assembly code samples. We release this dataset and analyze the success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder, CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly code. The evaluation was performed using established information-theoretic metrics and manual human review to ensure correctness and provide the foundation for researchers to study and develop remediations to this risk. The source code can be found at the following GitHub link: https://github.com/mohammadi-ali/MetamorphASM.
翻译:恶意软件作者常采用代码混淆技术以增强其软件的隐蔽性。现有的混淆代码生成工具通常需要访问原始源代码(如C++或Java),且添加新的混淆方式是一个复杂且劳动密集型的过程。本研究探讨以下问题:大型语言模型(LLMs)是否能够生成新的混淆汇编代码?若可行,这将给反病毒引擎带来风险,并可能增强攻击者创建新混淆模式的灵活性。我们通过构建MetamorphASM基准测试集给出了肯定答案,该基准包含MetamorphASM数据集(MAD)及三种代码混淆技术:死代码插入、寄存器替换和控制流变更。MetamorphASM利用包含328,200个混淆汇编代码样本的MAD数据集,系统评估了LLMs生成与分析混淆代码的能力。我们开源了该数据集,并分析了多种LLMs(包括GPT-3.5/4、GPT-4o-mini、Starcoder、CodeGemma、CodeLlama、CodeT5和LLaMA 3.1)生成混淆汇编代码的成功率。评估采用既有的信息论度量标准与人工审查相结合的方式,以确保正确性,并为研究者研究及开发应对此风险的解决方案奠定基础。源代码可通过以下GitHub链接获取:https://github.com/mohammadi-ali/MetamorphASM。