With the growing availability of large-scale biomedical data, it is often time-consuming or infeasible to directly perform traditional statistical analysis with relatively limited computing resources at hand. We propose a fast subsampling method to effectively approximate the full data maximum partial likelihood estimator in Cox's model, which largely reduces the computational burden when analyzing massive survival data. We establish consistency and asymptotic normality of a general subsample-based estimator. The optimal subsampling probabilities with explicit expressions are determined via minimizing the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator. We propose a two-step subsampling algorithm for practical implementation, which has a significant reduction in computing time compared to the full data method. The asymptotic properties of the resulting two-step subsample-based estimator is also established. Extensive numerical experiments and a real-world example are provided to assess our subsampling strategy.
翻译:随着大规模生物医学数据的日益可得,在计算资源相对有限的情况下,直接进行传统统计分析往往耗时甚至不可行。我们提出了一种快速子抽样方法,可有效逼近Cox模型中基于全部数据的最大偏似然估计量,从而显著降低分析大规模生存数据时的计算负担。我们建立了基于一般子样本估计量的一致性与渐近正态性。通过最小化线性变换后参数估计量的渐近方差-协方差矩阵的迹,确定了具有显式表达式的最优子抽样概率。我们提出了一种实用的两步子抽样算法,与基于全部数据的方法相比,该算法大幅缩短了计算时间。此外,我们还建立了所得到的两步子抽样估计量的渐近性质。通过大量数值实验和一个真实数据实例,我们对所提出的子抽样策略进行了评估。