In survival analysis, complex machine learning algorithms have been increasingly used for predictive modeling. Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. In particular, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of infection over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to infection are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 study to inform enrollment strategies for future HIV vaccine trials.
翻译:在生存分析中,复杂机器学习算法被日益广泛地应用于预测建模。当存在一组可用于纳入预测模型的特征时,量化特定特征子集对当前预测任务的相对重要性具有实际意义。尤其是在HIV疫苗试验中,研究者利用受试者基线特征预测其预期随访期内的感染概率,并希望了解行为因素等特定类型预测因子对整体预测效能的贡献程度。时间至感染这类时间至事件结局常存在右删失现象,而现有用于评估变量重要性的方法通常不适用于该场景。本文针对生存数据预测场景,提出一类广泛的算法无关型变量重要性度量方法。我们构建了融合灵活学习烦扰参数的非参数高效估计程序,该方法可推导渐近有效推断,并具备双重稳健性。通过数值模拟评估所提方法性能,并基于HVTN 702研究数据进行分析,为未来HIV疫苗试验的入组策略提供信息支持。