Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. Here, we investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommmendation always holds and why this is sometimes true. We examine deterministic imputation (i.e., single imputation where the imputed values are treated as fixed) and stochastic imputation (i.e., single imputation with a random value or multiple imputation) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This paper aims to bridge the gap between imputation in theory and in practice, providing mathematical derivations to explain common statistical recommendations. We offer a better understanding of the considerations involved in imputing missing covariates and emphasize when it is necessary to include the outcome variable in the imputation model.
翻译:缺失数据是流行病学数据分析中的常见挑战,插补常被用于解决该问题。本文研究分析中使用的协变量存在缺失且需进行插补的场景。尽管已有建议主张在缺失协变量的插补模型中纳入分析模型的结果变量,但该建议是否普遍成立及其背后的原因仍不明确。我们考察了确定性插补(即单次插补中将插补值视为固定值)与随机插补(即单次插补中引入随机值或进行多重插补)方法,并评估其对估计插补协变量与结果变量之间关系的影响。我们从数学上证明,使用随机插补方法时,在插补模型中纳入结果变量不仅是一项建议,更是实现无偏估计的必要条件。此外,我们澄清了关于确定性插补模型的常见误解,并论证了为何此类模型不应包含结果变量。本文旨在弥合插补理论与实践之间的差距,通过数学推导解释常见的统计学建议。我们提供了对缺失协变量插补中所需考量因素的更深入理解,并强调了何时必须将结果变量纳入插补模型。