Software debloating tools seek to improve program security and performance by removing unnecessary code, called bloat. While many techniques have been proposed, several barriers to their adoption have emerged. Namely, debloating tools are highly specialized, making it difficult for adopters to find the right type of tool for their needs. This is further hindered by a lack of established metrics and comparative evaluations between tools. To close this information gap, we surveyed 10 years of debloating literature and several tools currently under commercial development to taxonomize knowledge about the debloating ecosystem. We then conducted a broad comparative evaluation of 10 debloating tools to determine their relative strengths and weaknesses. Our evaluation, conducted on a diverse set of 20 benchmark programs, measures tools across 12 performance, security, and correctness metrics. Our evaluation surfaces several concerning findings that contradict the prevailing narrative in the debloating literature. First, debloating tools lack the maturity required to be used on real-world software, evidenced by a slim 22% overall success rate for creating passable debloated versions of medium- and high-complexity benchmarks. Second, debloating tools struggle to produce sound and robust programs. Using our novel differential fuzzing tool, DIFFER, we discovered that only 13% of our debloating attempts produced a sound and robust debloated program. Finally, our results indicate that debloating tools typically do not improve the performance or security posture of debloated programs by a significant degree according to our evaluation metrics. We believe that our contributions in this paper will help potential adopters better understand the landscape of tools and will motivate future research and development of more capable debloating tools.
翻译:软件去膨胀工具旨在通过移除不必要的代码(称为膨胀代码)来提升程序安全性与性能。尽管已有多种技术被提出,但其实际应用仍面临若干障碍。具体而言,去膨胀工具高度专业化,导致使用者难以根据需求选择合适的工具类型。现有研究中缺乏公认的评估指标与工具间的比较性评估,进一步加剧了这一困境。为填补这一信息空白,我们系统调研了过去十年的去膨胀文献及多个处于商业开发阶段的工具,对去膨胀技术生态进行了知识分类。在此基础上,我们对10种去膨胀工具开展了广泛的比较评估,以明确其相对优势与局限。本次评估基于20个多样化的基准程序,通过12项性能、安全性与正确性指标对工具进行量化分析。评估结果揭示了若干与当前去膨胀文献主流观点相悖的严峻发现:首先,去膨胀工具尚未达到可应用于实际软件的成熟度——在中高复杂度基准程序上,仅22%的案例能生成基本可用的去膨胀版本;其次,工具难以生成健全且鲁棒的程序。通过我们新开发的差分模糊测试工具DIFFER,发现仅13%的去膨胀尝试能产生健全鲁棒的去膨胀程序;最后,根据我们的评估指标,去膨胀工具通常无法显著提升程序的性能或安全态势。我们相信,本文的贡献将帮助潜在使用者更清晰地认识工具生态现状,并推动未来研发更具实用性的去膨胀工具。