A neutral comparison of statistical methods for time-to-event analyses under non-proportional hazards

Florian Klinglmüller,Tobias Fellinger,Franz König,Tim Friede,Andrew C. Hooker,Harald Heinzl,Martina Mittlböck,Jonas Brugger,Maximilian Bardo,Cynthia Huber,Norbert Benda,Martin Posch,Robin Ristl

While well-established methods for time-to-event data are available when the proportional hazards assumption holds, there is no consensus on the best inferential approach under non-proportional hazards (NPH). However, a wide range of parametric and non-parametric methods for testing and estimation in this scenario have been proposed. To provide recommendations on the statistical analysis of clinical trials where non proportional hazards are expected, we conducted a comprehensive simulation study under different scenarios of non-proportional hazards, including delayed onset of treatment effect, crossing hazard curves, subgroups with different treatment effect and changing hazards after disease progression. We assessed type I error rate control, power and confidence interval coverage, where applicable, for a wide range of methods including weighted log-rank tests, the MaxCombo test, summary measures such as the restricted mean survival time (RMST), average hazard ratios, and milestone survival probabilities as well as accelerated failure time regression models. We found a trade-off between interpretability and power when choosing an analysis strategy under NPH scenarios. While analysis methods based on weighted logrank tests typically were favorable in terms of power, they do not provide an easily interpretable treatment effect estimate. Also, depending on the weight function, they test a narrow null hypothesis of equal hazard functions and rejection of this null hypothesis may not allow for a direct conclusion of treatment benefit in terms of the survival function. In contrast, non-parametric procedures based on well interpretable measures as the RMST difference had lower power in most scenarios. Model based methods based on specific survival distributions had larger power, however often gave biased estimates and lower than nominal confidence interval coverage.

翻译：尽管当比例风险假设成立时已有成熟的时间至事件数据分析方法，但在非比例风险（NPH）条件下，最佳推断方法尚无共识。然而，针对该场景下测试与估计的多种参数和非参数方法已被提出。为对预期存在非比例风险的临床试验统计分析提供建议，我们在不同非比例风险场景下开展了一项综合模拟研究，包括治疗效果延迟出现、风险函数曲线交叉、不同亚组存在差异化治疗效果以及疾病进展后风险变化。我们评估了多种方法的一类错误率控制、统计功效和置信区间覆盖率（适用时），包括加权对数秩检验、MaxCombo检验、限制平均生存时间（RMST）、平均风险比和里程碑生存概率等汇总指标，以及加速失效时间回归模型。研究发现在非比例风险场景下选择分析策略时，可解释性与统计功效之间存在权衡。基于加权对数秩检验的分析方法通常在统计功效方面具有优势，但无法提供易于解释的治疗效果估计。此外，根据权重函数的不同，这类方法检验的是风险函数相等的狭义零假设，拒绝该零假设可能无法直接得出生存函数层面治疗获益的结论。相比之下，基于RMST差异等易于解释指标的非参数方法在大多数场景下统计功效较低。基于特定生存分布的模型方法虽具有更高统计功效，但常导致估计偏倚和低于名义水平的置信区间覆盖率。