Comparative Evaluation of Machine Learning Models for Predicting Donor Kidney Discard

A kidney transplant can improve the life expectancy and quality of life of patients with end-stage renal failure. Even more patients could be helped with a transplant if the rate of kidneys that are discarded and not transplanted could be reduced. Machine learning (ML) can support decision-making in this context by early identification of donor organs at high risk of discard, for instance to enable timely interventions to improve organ utilization such as rescue allocation. Although various ML models have been applied, their results are difficult to compare due to heterogenous datasets and differences in feature engineering and evaluation strategies. This study aims to provide a systematic and reproducible comparison of ML models for donor kidney discard prediction. We trained five commonly used ML models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Deep Learning along with an ensemble model on data from 4,080 deceased donors (death determined by neurologic criteria) in Germany. A unified benchmarking framework was implemented, including standardized feature engineering and selection, and Bayesian hyperparameter optimization. Model performance was assessed for discrimination (MCC, AUC, F1), calibration (Brier score), and explainability (SHAP). The ensemble achieved the highest discrimination performance (MCC=0.76, AUC=0.87, F1=0.90), while individual models such as Logistic Regression, Random Forest, and Deep Learning performed comparably and better than Decision Trees. Platt scaling improved calibration for tree-and neural network-based models. SHAP consistently identified donor age and renal markers as dominant predictors across models, reflecting clinical plausibility. This study demonstrates that consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm.

翻译：肾移植可改善终末期肾衰竭患者的预期寿命及生活质量。若能降低被废弃而无法移植的肾脏比例，更多患者可通过移植获得救治。机器学习通过早期识别具有高废弃风险的供体器官（例如实现救援分配等及时干预措施以提升器官利用率），可为该领域的决策提供支持。尽管已有多种机器学习模型被应用，但由于数据集异质性、特征工程及评估策略差异，其研究结果难以比较。本研究旨在提供一种系统化且可复现的供体肾脏废弃预测机器学习模型比较方案。我们基于德国4,080例脑死亡后器官捐献者数据，训练了五种常用模型：逻辑回归、决策树、随机森林、梯度提升、深度学习，以及集成模型。研究统一设计了基准评估框架，包含标准化特征工程与筛选、贝叶斯超参数优化。模型性能评估聚焦于判别能力（MCC、AUC、F1）、校准度（Brier评分）与可解释性（SHAP）。集成模型取得最高判别性能（MCC=0.76，AUC=0.87，F1=0.90），而逻辑回归、随机森林与深度学习等单独模型表现相当且优于决策树。Platt缩放可改进基于树模型与神经网络的校准度。SHAP一致识别出供体年龄与肾脏标志物为各模型的主要预测因子，符合临床合理性。本研究证明：相比于机器学习算法的选择，一致的数据预处理、特征筛选与评估策略对预测成功更具决定性作用。