I would like to share recommendations on how to do performance benchmarks for the purpose of computer science research evaluation. Research in my field (programming language research) often involves performance considerations, but it is typically not the main tool used to evaluate our research (typically we evaluate via formal statements and their proofs, experience writing large or interesting examples, or systematic comparison of expressivity, feature set, etc.). My impression is that, as a result, we tend to not do our performance evaluation very well. In the present document I will try to explain a methodology to do benchmarking correctly (I hope!). People with no former benchmarking experience should be able to build solid performance evaluation as part of their research. I explain the justification for each aspect along the way.
翻译:我想就如何为计算机科学研究评估进行性能基准测试分享一些建议。我所从事的研究领域(编程语言研究)通常涉及性能考量,但这并非我们评估研究的主要工具(通常我们通过形式化陈述及其证明、撰写大型或有趣示例的经验、或对表达能力、功能特性集等进行系统比较来评估)。我的印象是,因此我们往往未能很好地开展性能评估。在本文档中,我将尝试解释一种正确进行基准测试的方法(希望如此!)。没有基准测试经验的研究人员应能以此为基础,在他们的研究中构建起可靠的性能评估体系。我将沿途解释每个环节的合理性依据。