In the database community, we typically evaluate new methods based on experimental results, which we produce by integrating the proposed method along with a set of baselines in a single benchmarking codebase and measuring the individual runtimes. If we are unhappy with the performance of our method, we gradually improve it while repeatedly comparing to the baselines, until we outperform them. While this seems like a reasonable approach, it makes one delicate assumption: We assume that across the optimization workflow, there exists only a single compiled version of each baseline to compare to. However, we learned the hard way that in practice, even though the source code remains untouched, general purpose compilers might still generate highly different compiled code across builds, caused by seemingly unrelated changes in other parts of the codebase, leading to flawed comparisons and evaluations. To tackle this problem, we propose the concept of Multi-Version Experimental Evaluation (MVEE). MVEE automatically and transparently analyzes subsequent builds on the assembly code level for occurring "build anomalies" and materializes them as new versions of the methods. As a consequence, all observed versions of the respective methods can be included in the experimental evaluation, highly increasing its quality and overall expressiveness.
翻译:在数据库研究领域,我们通常基于实验评估新方法,具体做法是将所提方法与一组基线方法集成到单一基准测试代码库中,测量各自的运行时间。若对自身方法性能不满意,我们会逐步优化并反复与基线方法比较,直至超越它们。尽管这看似合理,但隐含一个关键假设:我们假设在整个优化流程中,每个基线方法仅有单个编译版本可供比较。然而,我们通过实践教训发现:即便源代码未改动,通用编译器也可能因代码库中其他部分的无关修改,在不同构建中生成差异极大的编译代码,从而导致错误的比较与评估。为解决此问题,我们提出多版本实验评估(MVEE)概念。MVEE通过自动且透明地分析后续构建的汇编代码,识别出现的"构建异常",并将其具象化为方法的新版本。由此,各方法的所有观测版本均可纳入实验评估,显著提升评估质量与整体表现力。