CODEFUSE-DEBENCH: An Empirical Study on Readability, Recompilability, and Functionality

Binary decompilation aims to recover binaries into high-level source code, but existing evaluations mainly rely on syntactic similarity or single-axis readability metrics, which fail to capture practical reusability. We propose a reusability-driven evaluation paradigm that measures decompiler quality along three orthogonal dimensions: readability, recompilability, and functionality. We present DEBENCH, the first automated framework for multidimensional decompilation evaluation. DEBENCH contains 240 atomic test functions, organized into 8 source files and compiled into 640 binaries. It combines LLM-as-judge readability scoring with URAF (18 sub-dimensions), iterative compile-and-repair under a fixed 50-iteration budget, and Frida-based differential dynamic tracing at the program, function, and instruction levels. We evaluate five mainstream decompilers and three repair LLMs. Our study reveals four findings. First, the reusability cliff is steep: the best decompiler-LLM pair reaches 22.3% Exact+Partial program-level behavioral overlap but only 1.2% exact stdout match, nearly 50 points below recompilability. Second, settings that maximize readability do not maximize functionality: -O3 yields the lowest readability but the highest functionality, and Clang gives lower readability than GCC but 2.6x higher functionality. Third, cross-decompiler variation at the functional level is 20x, far larger than the 1.6x cross-LLM variation, showing that progress depends more on decompiler engines than larger repair models. Fourth, failures fall into three categories: syntactic noise, type-system collapse (about 19% of repair errors), and irreversible upstream losses such as ARM64 relocation idioms and C++ ABI features.

翻译：二进制反编译旨在将二进制文件恢复为高级源代码，但现有评估主要依赖句法相似性或单一维度的可读性指标，无法捕捉实际的可重用性。我们提出了一种以可重用性为导向的评估范式，沿三个正交维度衡量反编译器质量：可读性、可重编译性和功能性。我们提出了首个用于多维反编译评估的自动化框架DEBENCH。DEBENCH包含240个原子测试函数，组织为8个源文件并编译为640个二进制文件。它结合了基于LLM作为裁判的可读性评分（包含18个子维度的URAF）、固定50次迭代预算下的迭代编译与修复，以及在程序、函数和指令级别基于Frida的差分动态追踪。我们评估了五种主流反编译器和三种修复LLM。研究揭示了四项发现：第一，可重用性悬崖陡峭：最佳反编译器-LLM组合在程序级行为重叠率上达到22.3%（精确+部分匹配），但精确标准输出匹配率仅为1.2%，较可重编译性低近50个百分点。第二，最大化可读性的设置并不最大化功能性：-O3优化级别产生最低的可读性但最高的功能性，Clang编译器虽可读性低于GCC，但功能性是其2.6倍。第三，跨反编译器在功能层面的差异为20倍，远大于跨LLM的1.6倍差异，表明进展更依赖于反编译器引擎而非更大的修复模型。第四，失败可归为三类：句法噪声、类型系统崩溃（约占修复错误的19%），以及不可逆的上游损失（如ARM64重定位惯用法和C++ ABI特性）。