RandSet: Randomized Corpus Reduction for Fuzzing Seed Scheduling

Seed explosion is a fundamental problem in fuzzing seed scheduling, where a fuzzer maintains a huge corpus and fails to choose promising seeds. Existing works focus on seed prioritization but still suffer from seed explosion since corpus size remains huge. We tackle this from a new perspective: corpus reduction, i.e., computing a seed corpus subset. However, corpus reduction could lead to poor seed diversity and large runtime overhead. Prior techniques like cull_queue, AFL-Cmin, and MinSet suffer from poor diversity or prohibitive overhead, making them unsuitable for high-frequency seed scheduling. We propose RandSet, a novel randomized corpus reduction technique that reduces corpus size and yields diverse seed selection simultaneously with minimal overhead. Our key insight is introducing randomness into corpus reduction to enjoy two benefits of a randomized algorithm: randomized output (diverse seed selection) and low runtime cost. Specifically, we formulate corpus reduction as a set cover problem and compute a randomized subset covering all features of the entire corpus. We then schedule seeds from this small, randomized subset rather than the entire corpus, effectively mitigating seed explosion. We implement RandSet on three popular fuzzers: AFL++, LibAFL, and Centipede, and evaluate it on standalone programs, FuzzBench, and Magma. Results show RandSet achieves significantly more diverse seed selection than other reduction techniques, with average subset ratios of 4.03% and 5.99% on standalone and FuzzBench programs. RandSet achieves a 16.58% coverage gain on standalone programs and up to 3.57% on FuzzBench in AFL++, triggers up to 7 more ground-truth bugs than the state-of-the-art on Magma, while introducing only 1.17%-3.93% overhead.

翻译：种子爆炸是模糊测试种子调度中的一个根本性问题，即模糊器维护庞大的种子语料库而无法选择有潜力的种子。现有工作侧重于种子优先级排序，但由于语料库规模仍然庞大，仍受困于种子爆炸问题。我们从一个新视角解决该问题：语料缩减，即计算种子语料库的子集。然而，语料缩减可能导致种子多样性下降和运行时开销过大。现有技术如cull_queue、AFL-Cmin和MinSet存在多样性不足或开销过高的问题，使其不适用于高频种子调度场景。我们提出RandSet，一种新颖的随机化语料缩减技术，能以最小开销同时实现语料库规模缩减和多样化种子选择。我们的核心洞见是将随机性引入语料缩减过程，从而获得随机化算法的双重优势：随机化输出（多样化种子选择）和低运行时成本。具体而言，我们将语料缩减建模为集合覆盖问题，并计算一个能覆盖整个语料库所有特征的随机化子集。随后我们从这个小型随机化子集而非整个语料库中调度种子，从而有效缓解种子爆炸问题。我们在三个主流模糊测试器（AFL++、LibAFL和Centipede）上实现了RandSet，并在独立程序、FuzzBench和Magma基准集上进行了评估。实验结果表明，RandSet相比其他缩减技术实现了显著更优的种子多样性，在独立程序和FuzzBench程序上的平均子集比例分别为4.03%和5.99%。在AFL++中，RandSet在独立程序上获得了16.58%的覆盖率提升，在FuzzBench上最高提升3.57%；在Magma基准集上比现有最优技术多触发多达7个真实漏洞，同时仅引入1.17%-3.93%的额外开销。