Systematic Assessment of Fuzzers using Mutation Analysis

Fuzzing is an important method to discover vulnerabilities in programs. Despite considerable progress in this area in the past years, measuring and comparing the effectiveness of fuzzers is still an open research question. In software testing, the gold standard for evaluating test quality is mutation analysis, which evaluates a test's ability to detect synthetic bugs: If a set of tests fails to detect such mutations, it is expected to also fail to detect real bugs. Mutation analysis subsumes various coverage measures and provides a large and diverse set of faults that can be arbitrarily hard to trigger and detect, thus preventing the problems of saturation and overfitting. Unfortunately, the cost of traditional mutation analysis is exorbitant for fuzzing, as mutations need independent evaluation. In this paper, we apply modern mutation analysis techniques that pool multiple mutations and allow us -- for the first time -- to evaluate and compare fuzzers with mutation analysis. We introduce an evaluation bench for fuzzers and apply it to a number of popular fuzzers and subjects. In a comprehensive evaluation, we show how we can use it to assess fuzzer performance and measure the impact of improved techniques. The required CPU time remains manageable: 4.09 CPU years are needed to analyze a fuzzer on seven subjects and a total of 141,278 mutations. We find that today's fuzzers can detect only a small percentage of mutations, which should be seen as a challenge for future research -- notably in improving (1) detecting failures beyond generic crashes (2) triggering mutations (and thus faults).

翻译：模糊测试是发现程序中漏洞的重要方法。尽管该领域在过去几年取得了显著进展，但衡量和比较模糊测试工具的有效性仍是一个开放的研究问题。在软件测试中，评估测试质量的金标准是变异分析，它通过评估测试检测合成错误的能力来评判：若一组测试无法检测到此类变异，则预计也无法检测到真实错误。变异分析涵盖了多种覆盖度量标准，提供了大量且多样化的故障集，这些故障可能难以触发和检测，从而避免了饱和与过拟合问题。然而，传统的变异分析因需要独立评估每个变异，其成本对于模糊测试而言过高。在本文中，我们应用了现代变异分析技术，该技术可合并多个变异，使我们能够首次利用变异分析评估和比较模糊测试工具。我们引入了一个模糊测试工具评估基准，并将其应用于多个流行的模糊测试工具和测试对象。在一项全面评估中，我们展示了如何利用该基准评估模糊测试工具的性能并衡量改进技术的影响。所需的CPU时间仍可控：分析一个模糊测试工具在七个测试对象上的141,278个变异共需4.09 CPU年。我们发现，当前的模糊测试工具只能检测到一小部分变异，这应被视为未来研究的挑战——尤其是在以下方面的改进：(1) 检测超出一般崩溃类型的故障；(2) 触发变异（进而触发故障）。