WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development, offering a compact binary format that allows high-performance applications to run at near-native speeds in web browsers. Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers, particularly regarding readability when debugging or analyzing web applications. Therefore, effective decompilation becomes crucial. Unfortunately, traditional decompilers often struggle with producing readable outputs. While some large language model (LLM)-based decompilers have shown good compatibility with general binary files, they still face specific challenges when dealing with Wasm. In this paper, we introduce a novel approach, WaDec, which is the first use of a fine-tuned LLM to interpret and decompile Wasm binary code into a higher-level, more comprehensible source code representation. The LLM was meticulously fine-tuned using a specialized dataset of wat-c code snippets, employing self-supervised learning techniques. This enables WaDec to effectively decompile not only complete wat functions but also finer-grained wat code snippets. Our experiments demonstrate that WaDec markedly outperforms current state-of-the-art tools, offering substantial improvements across several metrics. It achieves a code inflation rate of only 3.34%, a dramatic 97% reduction compared to the state-of-the-art's 116.94%. Unlike baselines' output that cannot be directly compiled or executed, WaDec maintains a recompilability rate of 52.11%, a re-execution rate of 43.55%, and an output consistency of 27.15%. Additionally, it significantly exceeds state-of-the-art performance in AST edit distance by 185%, cyclomatic complexity by 8%, and cosine similarity by 41%, achieving an average code similarity above 50%.
翻译:WebAssembly(简称Wasm)已成为Web开发的基石技术,其紧凑的二进制格式使得高性能应用程序能在浏览器中以接近原生速度运行。尽管优势显著,Wasm的二进制特性给开发者和研究人员带来了重大挑战,尤其在调试或分析Web应用时的可读性方面。因此,有效的反编译技术变得至关重要。然而,传统反编译器往往难以生成可读的输出。虽然某些基于大语言模型(LLM)的反编译器在通用二进制文件处理中表现出良好兼容性,但在处理Wasm时仍面临特定挑战。本文提出创新方法WaDec,首次采用微调LLM将Wasm二进制代码解析并反编译为更高级、更易理解的源代码表示。该LLM通过专门构建的wat-C代码片段数据集,采用自监督学习技术进行精细微调。这使得WaDec不仅能有效反编译完整的wat函数,还能处理更细粒度的wat代码片段。实验表明,WaDec在多项指标上显著优于当前最先进工具:其代码膨胀率仅为3.34%,较现有最优工具的116.94%降低达97%;与基线输出无法直接编译执行不同,WaDec保持52.11%的重编译率、43.55%的重执行率及27.15%的输出一致性;此外,在AST编辑距离上超越最优性能185%,圈复杂度提升8%,余弦相似度提高41%,平均代码相似度超过50%。