Code decompilation analysis is a fundamental yet challenging task in malware reverse engineering, particularly due to the pervasive use of sophisticated obfuscation techniques. Although recent large language models (LLMs) have shown promise in translating low-level representations into high-level source code, most existing approaches rely on generic code pretraining and lack adaptation to malicious software. We propose LLM4CodeRE, a domain-adaptive LLM framework for bidirectional code reverse engineering that supports both assembly-to-source decompilation and source-to-assembly translation within a unified model. To enable effective task adaptation, we introduce two complementary fine-tuning strategies: (i) a Multi-Adapter approach for task-specific syntactic and semantic alignment, and (ii) a Seq2Seq Unified approach using task-conditioned prefixes to enforce end-to-end generation constraints. Experimental results demonstrate that LLM4CodeRE outperforms existing decompilation tools and general-purpose code models, achieving robust bidirectional generalization.
翻译:代码反编译分析是恶意软件逆向工程中一项基础但具有挑战性的任务,尤其是由于复杂混淆技术的广泛使用。尽管近期的大语言模型(LLM)在将低级表示翻译为高级源代码方面展现出潜力,但现有方法大多依赖于通用代码预训练,缺乏对恶意软件的适应性。我们提出LLM4CodeRE,一个面向双向代码逆向工程的领域自适应LLM框架,支持在统一模型内同时完成汇编到源代码的反编译及源代码到汇编的翻译。为了实现有效的任务适配,我们引入了两种互补的微调策略:(i)用于任务特定句法和语义对齐的多适配器方法,以及(ii)利用任务条件前缀施加端到端生成约束的序列到序列统一方法。实验结果表明,LLM4CodeRE优于现有的反编译工具和通用代码模型,实现了稳健的双向泛化能力。