Omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. This results in the model attributing the missing variables' effect to some of the included variables -- hence over- or under-estimating the latter's true effect. Omitted variable bias presents a significant threat to the validity of empirical research, particularly in non-experimental studies such as those prevalent in empirical software engineering. This paper illustrates the impact of omitted variable bias on two case studies in the software engineering domain, and uses them to present methods to investigate the possible presence of omitted variable bias, to estimate its impact, and to mitigate its drawbacks. The analysis techniques we present are based on causal structural models of the variables of interest, which provide a practical, intuitive summary of the key relations among variables. This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering. An important observation is that it pays off to invest effort investigating omitted variable bias before actually executing an empirical study, because this effort can lead to a more solid study design, and to a significant reduction in its threats to validity.
翻译:遗漏变量偏差发生在统计模型遗漏了与研究效应相关的决定变量时。这导致模型将缺失变量的效应错误归因于某些已包含变量——从而高估或低估后者的真实效应。遗漏变量偏差对实证研究的有效性构成重大威胁,特别是在非实验性研究中,例如在经验软件工程中普遍存在的研究。本文通过软件工程领域的两个案例研究阐明了遗漏变量偏差的影响,并以此为基础提出了调查遗漏变量偏差可能存在的方法、估计其影响并减轻其缺陷的技术。我们提出的分析技术基于相关变量的因果结构模型,这些模型为变量间的关键关系提供了实用且直观的总结。本文展示了一系列分析步骤,可为软件工程领域任何实证研究的设计与实施提供指导。一个重要发现是:在实际执行实证研究之前投入精力调查遗漏变量偏差是值得的,因为这种投入能够带来更严谨的研究设计,并显著降低其对有效性的威胁。