In survival analysis, complex machine learning algorithms have been increasingly used for predictive modeling. Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. In particular, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of infection over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to infection are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 study to inform enrollment strategies for future HIV vaccine trials.
翻译:在生存分析中,复杂机器学习算法日益广泛应用于预测建模。面对可供纳入预测模型的特征集合,研究者通常需要量化特定特征子集对当前预测任务的相对重要性。特别是在HIV疫苗试验中,受试者基线特征被用于预测预期随访期内的感染概率,而研究人员希望了解某些类型预测因子(如行为因素)对整体预测能力的贡献程度。诸如感染时间等生存终点常面临右删失问题,现有变量重要性评估方法通常不适用于此场景。本文针对生存数据分析场景,提出了一类与算法无关的广义变量重要性度量体系。我们构建了非参数高效估计程序,该方案可灵活学习干扰参数,能产生渐近有效推断结果,并具备双重稳健特性。通过数值模拟评估所提方法的性能,并基于HVTN 702研究数据进行实证分析,为未来HIV疫苗试验的入组策略提供决策依据。