In clinical prediction settings the importance of a high-dimensional feature like genomics is often assessed by evaluating the change in predictive performance when adding it to a set of traditional clinical variables. This approach is questionable, because it does not account for collinearity nor known directionality of dependencies between variables. We suggest to use asymmetric Shapley values as a more suitable alternative to quantify feature importance in the context of a mixed-dimensional prediction model. We focus on a setting that is particularly relevant in clinical prediction: disease state as a mediating variable for genomic effects, with additional confounders for which the direction of effects may be unknown. We derive efficient algorithms to compute local and global asymmetric Shapley values for this setting. The former are shown to be very useful for inference, whereas the latter provide interpretation by decomposing any predictive performance metric into contributions of the features. Throughout, we illustrate our framework by a leading example: the prediction of progression-free survival for colorectal cancer patients.
翻译:在临床预测场景中,评估基因组等高维特征重要性时,通常采用将其加入传统临床变量集后预测性能的变化作为衡量标准。这种方法存在缺陷,因为它既未考虑变量间的共线性,也未考虑已知的依赖方向性。我们建议采用非对称Shapley值作为更合适的替代方案,用于量化混合维度预测模型中的特征重要性。我们重点关注临床预测中特别相关的场景:疾病状态作为基因组效应的中介变量,同时存在效应方向可能未知的额外混杂因子。针对该场景,我们推导出计算局部与全局非对称Shapley值的高效算法。前者被证明对统计推断极具价值,后者则通过将任意预测性能指标分解为特征贡献度来提供可解释性。我们始终通过一个典型案例来阐释该框架:结直肠癌患者无进展生存期的预测。