Computing the Hazard Ratios Associated with Explanatory Variables Using Machine Learning Models of Survival Data

Purpose: The application of Cox Proportional Hazards (CoxPH) models to survival data and the derivation of Hazard Ratio (HR) is well established. While nonlinear, tree-based Machine Learning (ML) models have been developed and applied to the survival analysis, no methodology exists for computing HRs associated with explanatory variables from such models. We describe a novel way to compute HRs from tree-based ML models using the Shapley additive explanation (SHAP) values, which is a locally accurate and consistent methodology to quantify explanatory variables' contribution to predictions. Methods: We used three sets of publicly available survival data consisting of patients with colon, breast or pan cancer and compared the performance of CoxPH to the state-of-art ML model, XGBoost. To compute the HR for explanatory variables from the XGBoost model, the SHAP values were exponentiated and the ratio of the means over the two subgroups calculated. The confidence interval was computed via bootstrapping the training data and generating the ML model 1000 times. Across the three data sets, we systematically compared HRs for all explanatory variables. Open-source libraries in Python and R were used in the analyses. Results: For the colon and breast cancer data sets, the performance of CoxPH and XGBoost were comparable and we showed good consistency in the computed HRs. In the pan-cancer dataset, we showed agreement in most variables but also an opposite finding in two of the explanatory variables between the CoxPH and XGBoost result. Subsequent Kaplan-Meier plots supported the finding of the XGBoost model. Conclusion: Enabling the derivation of HR from ML models can help to improve the identification of risk factors from complex survival datasets and enhance the prediction of clinical trial outcomes.

翻译：Cox 比例危害模型(CoxPH)应用于生存数据和危险比率(HR)的推算。虽然已经开发并应用了非线性、基于树的机器学习模型(ML)来进行生存分析,但在计算与这些模型的解释变量相关的HR方面没有方法。我们描述了一种新颖的方法,用基于树的模型(SHAP)解释值来计算基于树的 ML 模型的HR,这是用当地准确和一致的方法来量化解释变量对预测的贡献。方法:我们使用了三套公开的存活数据,由结肠、乳腺癌或全肠癌患者组成的患者组成,并将Cox-ML学习模型的性能进行了比较。为了计算XGBst模型的解释变量的HR值,SHAP值被推算出来,两个分组计算的方法比。信任间隔是用靴式来计算培训数据,并生成ML模型1000次。在三个数据集中,我们系统地比较了CHR对直肠、直径癌症和直径数据分析中的HR结果。O-L数据库显示,我们用于所有解释性分析的R-RO结果分析中的HR-结果。O结果显示,在Scaralalalalal-alalalalalalalalalalalation数据库中,我们使用了数据数据库中的数据分析中显示。