Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. However, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code. We develop a novel tokenizer and exploit no-dropout training to produce high-quality code. We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 functions from AnghaBench on two ISAs and at two optimizations levels. SLaDe is up to 6 times more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4 times more accurate than the large language model ChatGPT and generates significantly more readable code than both.
翻译:反编译是一个研究深入的领域,已存在众多高质量工具。这些工具常用于安全任务和遗留代码移植。然而,它们通常生成难以阅读的程序,且需要大量工程工作来支持新的编程语言和指令集架构(ISA)。近期对神经方法的兴趣催生了能够生成可读代码的可移植工具。但迄今为止,此类技术通常局限于无优化的合成程序,且尚无模型评估过其可移植性。此外,虽然生成的代码可能更具可读性,但通常不正确。本文提出SLaDe,一种基于序列到序列Transformer的小型语言模型反编译器,该模型在真实世界代码上训练。我们开发了一种新型分词器,并利用无丢弃训练来生成高质量代码。我们利用类型推断生成比标准分析方法和近期神经方法更具可读性和准确性的程序。与标准方法不同,SLaDe能推断上下文无关类型;与神经方法不同,它能生成正确代码。我们在AnghaBench的4000多个函数上,对两个ISA和两个优化级别评估了SLaDe。相比于最先进的工业级反编译器Ghidra,SLaDe准确率提升高达6倍;相比于大语言模型ChatGPT,准确率提升高达4倍,并且生成的代码可读性显著优于两者。