In empirical software engineering (SE) research, researchers have considerable freedom to decide how to process data, what operationalizations to use, and which statistical model to fit. Gelman and Loken refer to this freedom as leading to a "garden of forking paths". Although this freedom is often seen as an advantage, it also poses a threat to robustness and replicability: variations in analytical decisions, even when justifiable, can lead to divergent conclusions. To better understand this risk, we conducted a so-called multiverse analysis on a published empirical SE paper. The paper we picked is a Mining Software Repositories study, as MSR studies commonly use non-trivial statistical models to analyze post-hoc, observational data. In the study, we identified nine pivotal analytical decisions-each with at least one equally defensible alternative and systematically reran all the 3,072 resulting analysis pipelines on the original dataset. Interestingly, only 6 of these universes (<0.2%) reproduced the published results; the overwhelming majority produced qualitatively different, and sometimes even opposite, findings. This case study of a data analytical method commonly applied to empirical software engineering data reveals how methodological choices can exert a more profound influence on outcomes than is often acknowledged. We therefore advocate that SE researchers complement standard reporting with robustness checks across plausible analysis variants or, at least, explicitly justify each analytical decision. We propose a structured classification model to help classify and improve justification for methodological choices. Secondly, we show how the multiverse analysis is a practical tool in the methodological arsenal of SE researchers, one that can help produce more reliable, reproducible science.
翻译:在经验软件工程研究中,研究者拥有相当大的自由度来决定数据处理方式、操作化指标选择以及统计模型拟合策略。Gelman与Loken将这种自由度称为通向“分岔花园”的路径。尽管这种自由常被视为优势,但它也对研究的稳健性与可复现性构成威胁:分析决策的差异(即便是合理差异)可能导致截然不同的结论。为深入理解这一风险,我们对一篇已发表的经验软件工程论文进行了多元宇宙分析。所选论文属于软件仓库挖掘研究领域,因为此类研究常采用复杂的统计模型来分析回溯性观测数据。在该研究中,我们识别出九个关键分析决策——每个决策至少存在一种同等合理的替代方案,并基于原始数据集系统性地重新运行了全部3,072条分析流程。值得注意的是,仅6个分析流程(<0.2%)复现了已发表结果;绝大多数流程产生了性质不同的结论,有时甚至得出完全相反的发现。这项针对经验软件工程常用数据分析方法的案例研究揭示出:方法论选择对研究结果的影响往往比通常认知的更为深远。因此,我们建议软件工程研究者在标准报告基础上,补充对合理分析变体的稳健性检验,或至少对每个分析决策进行明确论证。我们提出一种结构化分类模型,以帮助对方法论选择的论证进行分类与改进。其次,我们展示了多元宇宙分析如何作为软件工程研究方法体系中的实用工具,助力产生更可靠、可复现的科学成果。