Omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. This results in the model attributing the missing variables' effect to some of the included variables -- hence over- or under-estimating the latter's true effect. Omitted variable bias presents a significant threat to the validity of empirical research, particularly in non-experimental studies such as those prevalent in empirical software engineering. This paper illustrates the impact of omitted variable bias on two illustrative examples in the software engineering domain, and uses them to present methods to investigate the possible presence of omitted variable bias, to estimate its impact, and to mitigate its drawbacks. The analysis techniques we present are based on causal structural models of the variables of interest, which provide a practical, intuitive summary of the key relations among variables. This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering. An important observation is that it pays off to invest effort investigating omitted variable bias before actually executing an empirical study, because this effort can lead to a more solid study design, and to a significant reduction in its threats to validity.
翻译:遗漏变量偏差发生在统计模型遗漏了与研究效应相关的决定变量时。这导致模型将缺失变量的效应错误归因于某些已包含变量——从而高估或低估后者的真实效应。遗漏变量偏差对实证研究的有效性构成重大威胁,特别是在非实验性研究中,例如实证软件工程中普遍存在的研究类型。本文通过软件工程领域的两个示例说明遗漏变量偏差的影响,并以此展示调查遗漏变量偏差可能存在的分析方法、评估其影响程度以及缓解其缺陷的技术。我们提出的分析技术基于相关变量的因果结构模型,这些模型能够对变量间的关键关系提供实用且直观的总结。本文展示了一系列分析步骤,可为软件工程领域任何实证研究的设计与实施提供指导。一个重要发现是:在实施实证研究前投入精力调查遗漏变量偏差具有显著价值,因为这种投入能够带来更严谨的研究设计,并显著降低对研究有效性的威胁。