Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. However, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code. We develop a novel tokenizer and exploit no-dropout training to produce high-quality code. We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 functions from AnghaBench on two ISAs and at two optimizations levels. SLaDe is up to 6 times more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4 times more accurate than the large language model ChatGPT and generates significantly more readable code than both.
翻译:反编译是一个研究充分的领域,已有众多高质量工具可用。这些工具通常用于安全任务和遗留代码移植。然而,它们常常生成难以理解的程序,且支持新编程语言和指令集架构(ISA)需要大量工程投入。近年来,神经方法的研究兴趣催生了可生成可读代码的可移植工具,但迄今为止,此类技术通常局限于无优化合成的程序,且尚无模型评估其可移植性。此外,虽然生成的代码可能更易读,但通常存在错误。本文提出SLaDe——一种基于序列到序列变换器的小型语言模型反编译器,该模型在真实世界代码上训练。我们开发了一种新颖的分词器,并利用无丢弃训练方法生成高质量代码。通过类型推断,我们生成的程序比标准分析方法及近期神经方法更易读且更准确。与标准方法不同,SLaDe能够推断上下文无关的类型;与神经方法相比,它能生成正确的代码。我们在AnghaBench的4000余个函数上,针对两种ISA及两种优化级别评估了SLaDe。SLaDe的准确率最高可达先进工业级反编译器Ghidra的6倍,以及大语言模型ChatGPT的4倍,且生成的代码可读性显著优于二者。