Evaluating Search-Based Software Microbenchmark Prioritization

Ensuring that software performance does not degrade after a code change is paramount. A solution is to regularly execute software microbenchmarks, a performance testing technique similar to (functional) unit tests, which, however, often becomes infeasible due to extensive runtimes. To address that challenge, research has investigated regression testing techniques, such as test case prioritization (TCP), which reorder the execution within a microbenchmark suite to detect larger performance changes sooner. Such techniques are either designed for unit tests and perform sub-par on microbenchmarks or require complex performance models, drastically reducing their potential application. In this paper, we empirically evaluate single- and multi-objective search-based microbenchmark prioritization techniques to understand whether they are more effective and efficient than greedy, coverage-based techniques. For this, we devise three search objectives, i.e., coverage to maximize, coverage overlap to minimize, and historical performance change detection to maximize. We find that search algorithms (SAs) are only competitive with but do not outperform the best greedy, coverage-based baselines. However, a simple greedy technique utilizing solely the performance change history (without coverage information) is equally or more effective than the best coverage-based techniques while being considerably more efficient, with a runtime overhead of less than 1%. These results show that simple, non-coverage-based techniques are a better fit for microbenchmarks than complex coverage-based techniques.

翻译：确保代码变更后软件性能不下降至关重要。一种解决方案是定期执行软件微基准测试——一种类似于（功能性）单元测试的性能测试技术，但由于运行时间过长往往难以实施。为应对这一挑战，研究人员探索了回归测试技术，例如测试用例优先级排序（TCP），该方法通过重新排序微基准测试套件的执行顺序，以便更快地检测到较大的性能变化。这类技术要么专为单元测试设计而难以胜任微基准测试，要么依赖复杂的性能模型，大幅降低了潜在应用价值。本文通过实证评估单目标和多目标基于搜索的微基准测试优先级排序技术，探究其是否比基于贪心策略和覆盖率的传统方法更有效、更高效。为此，我们设计了三个搜索目标：最大化覆盖率、最小化覆盖重叠度、最大化历史性能变化检测。研究发现，搜索算法（SA）仅能与最优的贪心覆盖基线方法竞争，但并未超越后者。然而，一种仅基于历史性能变化（不含覆盖率信息）的简单贪心技术，其有效性等同或优于最优的基于覆盖度的技术，同时效率显著提升（运行时开销低于1%）。这些结果表明，对于微基准测试，简单的非覆盖度方法比复杂的覆盖度方法更具适用性。