Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to HIV acquisition are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 vaccine trial to inform enrollment strategies for future HIV vaccine trials.
翻译:在构建预测模型时,当面临一组可供纳入的特征集合,量化其中特定特征子集对当前预测任务的相对贡献度具有重要意义。例如,在HIV疫苗临床试验中,研究者常利用参与者基线特征预测其在预设随访期内感染HIV的概率,此时需要评估特定类型预测因子(如行为因素)对整体预测能力的贡献程度。针对时间至事件结局(如HIV感染时间)的分析常面临右删失问题,而现有的变量重要性评估方法通常未考虑这一场景。本文提出了一类适用于生存数据预测的、与算法无关的变量重要性度量框架。我们设计了一种非参数高效估计方法,该方法通过灵活学习冗余参数实现渐近有效推断,并具备双重稳健性。通过数值模拟评估了所提方法的性能,并基于HVTN 702疫苗试验数据进行分析,为未来HIV疫苗试验的受试者招募策略提供参考依据。