Comparative Evaluation of Machine Learning Models for Predicting Donor Kidney Discard

A kidney transplant can improve the life expectancy and quality of life of patients with end-stage renal failure. Even more patients could be helped with a transplant if the rate of kidneys that are discarded and not transplanted could be reduced. Machine learning (ML) can support decision-making in this context by early identification of donor organs at high risk of discard, for instance to enable timely interventions to improve organ utilization such as rescue allocation. Although various ML models have been applied, their results are difficult to compare due to heterogenous datasets and differences in feature engineering and evaluation strategies. This study aims to provide a systematic and reproducible comparison of ML models for donor kidney discard prediction. We trained five commonly used ML models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Deep Learning along with an ensemble model on data from 4,080 deceased donors (death determined by neurologic criteria) in Germany. A unified benchmarking framework was implemented, including standardized feature engineering and selection, and Bayesian hyperparameter optimization. Model performance was assessed for discrimination (MCC, AUC, F1), calibration (Brier score), and explainability (SHAP). The ensemble achieved the highest discrimination performance (MCC=0.76, AUC=0.87, F1=0.90), while individual models such as Logistic Regression, Random Forest, and Deep Learning performed comparably and better than Decision Trees. Platt scaling improved calibration for tree-and neural network-based models. SHAP consistently identified donor age and renal markers as dominant predictors across models, reflecting clinical plausibility. This study demonstrates that consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm.

翻译：肾移植能够改善终末期肾衰竭患者的预期寿命与生活质量。若能降低供体肾脏的弃用率，则可通过移植帮助更多患者。在此背景下，机器学习可通过早期识别高弃用风险的供体器官来辅助决策，例如通过及时干预（如紧急调配）以提高器官利用率。尽管已有多种机器学习模型被应用，但由于数据集异质性、特征工程与评估策略的差异，其结果难以直接比较。本研究旨在对供体肾脏弃用预测的机器学习模型进行系统化、可复现的比较。我们基于德国4,080例脑死亡供体的数据，训练了五种常用机器学习模型：逻辑回归、决策树、随机森林、梯度提升和深度学习，并构建了集成模型。研究实施了统一的基准测试框架，包括标准化特征工程与选择，以及贝叶斯超参数优化。模型性能从区分度（MCC、AUC、F1）、校准度（Brier分数）和可解释性（SHAP）三方面进行评估。集成模型取得了最高的区分性能（MCC=0.76，AUC=0.87，F1=0.90），而逻辑回归、随机森林和深度学习等独立模型表现相当且优于决策树。Platt缩放法提升了基于树模型与神经网络模型的校准度。SHAP分析一致识别出供体年龄与肾脏标志物为各模型的主导预测因子，体现了临床合理性。本研究表明，一致的数据预处理、特征选择与评估策略对预测成功的影响可能比机器学习算法的选择更为关键。