大型语言模型能否恢复程序语义？基于符号执行的系统性评估 (Can LLMs Recover Program Semantics? A Systematic Evaluation with Symbolic Execution)

Obfuscation poses a persistent challenge for software engineering tasks such as program comprehension, maintenance, testing, and vulnerability detection. While compiler optimizations and third-party code often introduce transformations that obscure program intent, existing analysis tools and large language models (LLMs) struggle to recover the original semantics. In this work, we investigate whether LLMs, when fine-tuned with symbolic execution artifacts, can effectively deobfuscate programs and restore analyzability. We construct a benchmark by applying four widely studied transformations-control-flow flattening, opaque predicates, arithmetic encoding, and branch encoding-across diverse C programs from TUM Obfuscation Benchmarks, the LLVM test suite, and algorithmic repositories. We then compare three state-of-the-art LLMs under two training configurations: baseline fine-tuning on obfuscated/original code pairs, and enhanced fine-tuning with additional KLEE artifacts such as SMT constraints, path statistics, and test cases. Our evaluation examines syntactic correctness (compilation success), semantic fidelity (behavioral equivalence under symbolic execution), and code quality (readability and structure). Results show that GPT-4.1-mini achieves the strongest deobfuscation overall, and that incorporating KLEE artifacts consistently improves semantic preservation and compilation success across models. These findings highlight deobfuscation as a broader software engineering concern, demonstrating that combining LLMs with symbolic execution can strengthen automated testing, static analysis, and program comprehension in the presence of obfuscation.

翻译：代码混淆对软件工程任务（如程序理解、维护、测试和漏洞检测）构成持续挑战。尽管编译器优化和第三方代码常引入掩盖程序意图的变换，现有分析工具与大型语言模型（LLMs）仍难以恢复原始语义。本研究探讨了通过符号执行生成的中间产物进行微调的LLMs是否能有效反混淆程序并恢复可分析性。我们通过将四种广泛研究的变换——控制流平坦化、不透明谓词、算术编码和分支编码——应用于TUM混淆基准测试集、LLVM测试套件及算法仓库中的多样化C程序，构建了评估基准。随后，我们在两种训练配置下比较了三种前沿LLMs：基于混淆/原始代码对的基线微调，以及结合额外KLEE生成物（如SMT约束、路径统计信息和测试用例）的增强微调。评估涵盖语法正确性（编译成功率）、语义保真度（符号执行下的行为等价性）和代码质量（可读性与结构）。结果表明，GPT-4.1-mini在整体反混淆任务中表现最优，且融合KLEE生成物能持续提升所有模型的语义保留能力与编译成功率。这些发现凸显了反混淆作为软件工程广泛议题的重要性，证明结合LLMs与符号执行的技术可增强存在混淆场景下的自动化测试、静态分析与程序理解能力。