Exploring the Garden of Forking Paths in Empirical Software Engineering Research: A Multiverse Analysis

In empirical software engineering (SE) research, researchers have considerable freedom to decide how to process data, what operationalizations to use, and which statistical model to fit. Gelman and Loken refer to this freedom as leading to a "garden of forking paths". Although this freedom is often seen as an advantage, it also poses a threat to robustness and replicability: variations in analytical decisions, even when justifiable, can lead to divergent conclusions. To better understand this risk, we conducted a so-called multiverse analysis on a published empirical SE paper. The paper we picked is a Mining Software Repositories study, as MSR studies commonly use non-trivial statistical models to analyze post-hoc, observational data. In the study, we identified nine pivotal analytical decisions-each with at least one equally defensible alternative and systematically reran all the 3,072 resulting analysis pipelines on the original dataset. Interestingly, only 6 of these universes (<0.2%) reproduced the published results; the overwhelming majority produced qualitatively different, and sometimes even opposite, findings. This case study of a data analytical method commonly applied to empirical software engineering data reveals how methodological choices can exert a more profound influence on outcomes than is often acknowledged. We therefore advocate that SE researchers complement standard reporting with robustness checks across plausible analysis variants or, at least, explicitly justify each analytical decision. We propose a structured classification model to help classify and improve justification for methodological choices. Secondly, we show how the multiverse analysis is a practical tool in the methodological arsenal of SE researchers, one that can help produce more reliable, reproducible science.

翻译：在经验软件工程研究中，研究者拥有相当大的自由度来决定数据处理方式、操作化指标的选择以及统计模型的拟合。Gelman与Loken将这种自由度称为通往"分岔路径花园"的引路标。尽管这种自由度常被视为优势，但它也对研究的稳健性与可复现性构成威胁：分析决策的差异——即便是合理的差异——都可能导致截然不同的结论。为深入理解这一风险，我们对一篇已发表的经验软件工程论文进行了所谓的多元宇宙分析。所选论文属于软件仓库挖掘研究领域，因为这类研究通常采用复杂的统计模型来分析事后观察性数据。在该研究中，我们识别出九个关键分析决策——每个决策至少存在一种同等合理的替代方案，并在原始数据集上系统性地重新运行了所有3,072条由此生成的分析流程。值得注意的是，仅有6个分析宇宙（<0.2%）复现了已发表的结果；绝大多数分析流程产生了性质不同的发现，有时甚至得出完全相反的结论。这项针对经验软件工程常用数据分析方法的案例研究揭示出：方法论选择对研究结果的影响往往比通常认知的更为深远。因此，我们主张软件工程研究者应在标准报告之外，补充对合理分析变体的稳健性检验，或至少对每个分析决策进行明确论证。我们提出了一种结构化分类模型，以帮助对方法论选择的论证进行分类与改进。其次，我们展示了多元宇宙分析如何作为软件工程研究方法论工具箱中的实用工具，助力产生更可靠、可复现的科学成果。