Random survival forests are widely used for estimating covariate-conditional survival functions under right-censoring. Their standard log-rank splitting criterion is typically recomputed at each candidate split. This O(M) cost per split, with M the number of distinct event times in a node, creates a bottleneck for large cohort datasets with long follow-up. We revisit approximations proposed by LeBlanc and Crowley (1995) and develop simple constant-time updates for the log-rank criterion. The method is implemented in grf for R and reduces training time on large datasets while preserving predictive accuracy.
翻译:随机生存森林广泛应用于右删失数据下估计协变量条件生存函数。其标准的对数秩分裂准则通常在每次候选分裂时重新计算。每次分裂的O(M)计算成本(M为节点内不同事件时间数)对随访周期长的大规模队列数据集构成瓶颈。我们重新审视了LeBlanc和Crowley(1995)提出的近似方法,开发了简单的常数时间更新对数秩准则算法。该方法已在R包grf中实现,在保持预测精度的同时显著减少了大数据集的训练时间。