Clinical prediction models are increasingly used to support patient care, yet many deep learning-based approaches remain unstable, as their predictions can vary substantially when trained on different samples from the same population. Such instability undermines reliability and limits clinical adoption. In this study, we propose a novel bootstrapping-based regularisation framework that embeds the bootstrapping process directly into the training of deep neural networks. This approach constrains prediction variability across resampled datasets, producing a single model with inherent stability properties. We evaluated models constructed using the proposed regularisation approach against conventional and ensemble models using simulated data and three clinical datasets: GUSTO-I, Framingham, and SUPPORT. Across all datasets, our model exhibited improved prediction stability, with lower mean absolute differences (e.g., 0.019 vs. 0.059 in GUSTO-I; 0.057 vs. 0.088 in Framingham) and markedly fewer significantly deviating predictions. Importantly, discriminative performance and feature importance consistency were maintained, with high SHAP correlations between models (e.g., 0.894 for GUSTO-I; 0.965 for Framingham). While ensemble models achieved greater stability, we show that this came at the expense of interpretability, as each constituent model used predictors in different ways. By regularising predictions to align with bootstrapped distributions, our approach allows prediction models to be developed that achieve greater robustness and reproducibility without sacrificing interpretability. This method provides a practical route toward more reliable and clinically trustworthy deep learning models, particularly valuable in data-limited healthcare settings.
翻译:临床预测模型正日益广泛地应用于辅助患者诊疗,然而许多基于深度学习的方法仍存在不稳定性问题——当使用来自同一总体的不同样本进行训练时,其预测结果可能出现显著差异。这种不稳定性损害了模型的可靠性并限制了其临床推广应用。本研究提出了一种新颖的基于自助法的正则化框架,将自助采样过程直接嵌入深度神经网络的训练中。该方法通过约束模型在重抽样数据集上的预测变异性,构建出具有内在稳定性特征的单一模型。我们使用模拟数据及三个临床数据集(GUSTO-I、Framingham和SUPPORT),将采用所提正则化方法构建的模型与传统模型及集成模型进行了对比评估。在所有数据集中,我们的模型均展现出更优的预测稳定性:其平均绝对差异更低(例如GUSTO-I数据集中为0.019对比0.059;Framingham数据集中为0.057对比0.088),且显著偏离的预测数量明显减少。重要的是,模型在保持判别性能与特征重要性一致性的同时,各模型间SHAP相关性系数保持高位(例如GUSTO-I数据集中为0.894;Framingham数据集中为0.965)。尽管集成模型能获得更高的稳定性,但我们发现这是以牺牲可解释性为代价的——因为每个子模型使用预测因子的方式各不相同。通过将预测结果正则化以对齐自助法生成的分布,我们的方法使得预测模型能够在保持可解释性的同时,实现更强的鲁棒性与可复现性。该方法为构建更可靠、更具临床可信度的深度学习模型提供了实用路径,在数据有限的医疗场景中尤其具有重要价值。