Compatibility of Missing Data Handling Methods across the Stages of Producing Clinical Prediction Models

Missing data is a challenge when developing, validating and deploying clinical prediction models (CPMs). Traditionally, decisions concerning missing data handling during CPM development and validation havent accounted for whether missingness is allowed at deployment. We hypothesised that the missing data approach used during model development should optimise model performance upon deployment, whilst the approach used during model validation should yield unbiased predictive performance estimates upon deployment; we term this compatibility. We aimed to determine which combinations of missing data handling methods across the CPM life cycle are compatible. We considered scenarios where CPMs are intended to be deployed with missing data allowed or not, and we evaluated the impact of that choice on earlier modelling decisions. Through a simulation study and an empirical analysis of thoracic surgery data, we compared CPMs developed and validated using combinations of complete case analysis, mean imputation, single regression imputation, multiple imputation, and pattern sub-modelling. If planning to deploy a CPM without allowing missing data, then development and validation should use multiple imputation when required. Where missingness is allowed at deployment, the same imputation method must be used during development and validation. Commonly used combinations of missing data handling methods result in biased predictive performance estimates.

翻译：缺失数据是临床预测模型开发、验证与部署过程中的一项挑战。传统上，在模型开发和验证阶段关于缺失数据处理的决策往往未考虑部署阶段是否允许数据缺失的存在。我们提出假设：模型开发阶段采用的缺失数据处理方法应优化模型在部署时的性能，而模型验证阶段采用的方法应在部署时产生无偏的预测性能估计——我们将此特性定义为兼容性。本研究旨在确定临床预测模型全生命周期中哪些缺失数据处理方法的组合具有兼容性。我们考虑了两种部署场景：允许缺失数据存在与不允许缺失数据存在，并评估该选择对前期建模决策的影响。通过模拟研究和胸外科手术数据的实证分析，我们比较了采用不同方法组合（包括完整案例分析、均值插补、单次回归插补、多重插补和模式子建模）开发与验证的临床预测模型。若计划部署不允许缺失数据的临床预测模型，则开发与验证阶段在需要时应采用多重插补方法；若部署阶段允许缺失数据存在，则开发与验证阶段必须使用相同的插补方法。常用的缺失数据处理方法组合会导致预测性能估计产生偏差。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/