Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, even with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods.
翻译:受多中心生物医学研究因隐私和所有权问题无法共享个体数据的启发,我们针对高维稀疏Cox比例风险模型,开发了通信高效的迭代分布式估计算法与推断方法。我们证明,即使迭代次数相对较少,在非常温和的条件下,我们的估计量也能达到与理想全样本估计量相同的收敛速率。为构建高维风险回归系数线性组合的置信区间,我们引入了一种新颖的去偏方法,建立了中心极限定理,并提供了相合的方差估计量,从而得到渐近有效的分布式置信区间。此外,基于去相关得分检验,我们为任意坐标分量提供了有效且强大的分布式假设检验。该方法允许协变量随时间变化以及生存时间存在删失。在模拟数据和真实数据上的大量数值实验进一步支持了我们的理论,并表明我们的通信高效分布式估计量、置信区间和假设检验优于其他方法。