Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to supervised learning problems in the biomedical sciences. However, the greater prevalence and complexity of missing data in modern biomedical datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, \textit{dlglm}, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of a Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data. Supplementary materials for this article are available online.
翻译:近年来,深度学习方法在生物医学领域的监督学习问题中应用显著增长,但其在处理现代生物医学数据集中日益普遍且复杂的缺失数据时面临重大挑战。本文针对深度学习的广义线性模型(一种用于回归与分类问题的监督学习架构),系统论述了缺失数据的处理问题。我们提出了一种新型架构dlglm,该架构能灵活处理训练过程中输入特征与响应变量中可忽略与不可忽略的缺失模式,属业内首创。通过统计模拟实验证明,本方法在非随机缺失(MNAR)场景下的监督学习任务中优于现有方法。我们以UCI机器学习库中的银行营销数据集为例进行案例研究,基于电话调查数据预测客户是否订阅产品。本文补充材料可在线上获取。