Censored, missing, and error-prone covariates are all coarsened data types for which the true values are unknown. Many methods to handle the unobserved values, including imputation, are shared between these data types, with nuances based on the mechanism dominating the unobserved values and any other available information. For example, in prospective studies, the time to a specific disease diagnosis will be incompletely observed if only some patients are diagnosed by the end of the follow-up. Specifically, some times will be randomly right-censored, and patients' disease-free follow-up times must be incorporated into their imputed values. Assuming noninformative censoring, these censored values are replaced with their conditional means, the calculations of which require (i) estimating the conditional distribution of the censored covariate and (ii) integrating over the corresponding survival function. Semiparametric approaches are common, which estimate the distribution with a Cox proportional hazards model and then the integral with the trapezoidal rule. While these approaches offer robustness, they come at the cost of statistical and computational efficiency. We propose a general framework for parametric conditional mean imputation of censored covariates that offers better statistical precision and requires less computational strain by modeling the survival function parametrically, where conditional means often have an analytic solution. The framework is implemented in the open-source R package, speedyCMI.
翻译:删失、缺失以及含误差的协变量均属于粗化数据类型,其真实值未知。处理这些未观测值的方法(包括填补法)在这些数据类型间具有共通性,具体细节取决于主导未观测值的机制及其他可用信息。例如,在前瞻性研究中,若仅部分患者在随访结束时确诊特定疾病,则疾病诊断时间将无法被完整观测。具体而言,部分时间将存在随机右删失,此时必须将患者的无病随访时间纳入其填补值计算。在假定非信息性删失的条件下,这些删失值可被替换为其条件均值,其计算需要:(i) 估计删失协变量的条件分布;(ii) 对相应的生存函数进行积分。半参数方法是常见选择,通常通过Cox比例风险模型估计分布,再采用梯形法则进行积分。虽然这类方法具有稳健性,但需以统计与计算效率为代价。我们提出一个用于删失协变量的参数化条件均值填补通用框架,该框架通过对生存函数进行参数化建模(此时条件均值常存在解析解),从而提供更优的统计精度并降低计算负荷。该框架已在开源R包speedyCMI中实现。