Estimating risks or survival probabilities conditional on individual characteristics based on censored time-to-event data is a commonly faced task. This may be for the purpose of developing a prediction model or may be part of a wider estimation procedure, such as in causal inference. A challenge is that it is impossible to know at the outset which of a set of candidate models will provide the best risk estimates. The super learner is a powerful approach for finding the best model or combination of models ('ensemble') among a pre-specified set of candidate models or 'learners', which can include both 'statistical' models (e.g. parametric, semi-parametric models) and 'machine learning' models. Super learners for time-to-event outcomes have been developed, but the literature is technical and the full details of how these methods work and can be implemented in practice have not previously been presented in an accessible format. In this paper we provide a practical tutorial on super learner methods for time-to-event outcomes. An overview of the general steps involved in the super learner is given, followed by details of three specific implementations for time-to-event outcomes. These include the originally proposed super learner, which involves using a discrete time scale, and two more recently proposed versions of the super learner for continuous-time data. We compare the properties of the methods and provide information on how they can be implemented in R. The methods are illustrated using an open access data set and R code is provided.
翻译:基于删失生存时间数据估计个体特征条件下的风险或生存概率是一项常见任务。这既可用于开发预测模型,也可作为更广泛估计程序(如因果推断)的组成部分。面临的挑战在于,我们无法预先确定候选模型集中哪个模型能提供最优风险估计。超级学习器是一种在预定义候选模型集(可包含"统计"模型(如参数、半参数模型)与"机器学习"模型)中寻找最佳模型或模型组合("集成")的强大方法。针对生存结局的超级学习器虽已发展,但现有文献技术性强,且这些方法的完整工作原理及实际实施细节尚未以易理解的形式呈现。本文提供关于生存结局超级学习器方法的实用教程:首先概述超级学习器的一般步骤,随后详述三种针对生存结局的具体实现方案,包括最初提出的离散时间尺度超级学习器,以及两种近期提出的连续时间数据超级学习器版本。我们比较了这些方法的特性,并提供了在R语言中的实施指南。通过开放数据集演示方法应用,并提供配套R代码。