Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, even with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods.
翻译:受多中心生物医学研究的启发(这类研究因隐私和数据所有权问题无法共享个体数据),我们针对高维稀疏Cox比例风险模型开发了通信高效的迭代分布式估计算法。研究表明,在非常温和的条件下,即便迭代次数相对较少,我们的估计量也能达到与理想全样本估计量相同的收敛速度。为构建高维风险回归系数线性组合的置信区间,我们提出了一种新颖的去偏方法,建立了中心极限定理,并给出了能生成渐近有效分布式置信区间的一致方差估计量。此外,基于去相关得分检验,我们为任意坐标分量提供了有效且稳健的分布式假设检验。研究同时允许时依协变量和删失生存时间存在。在模拟数据和真实数据上进行的大量数值实验进一步支持了我们的理论,证明了所提出的通信高效分布式估计量、置信区间和假设检验方法优于其他替代方案。