Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to supervised learning problems in the biomedical sciences. However, the greater prevalence and complexity of missing data in modern biomedical datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, \textit{dlglm}, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of a Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data. Supplementary materials for this article are available online.
翻译:近年来,深度学习方法日益普及,其在生物医学科学监督学习问题中的应用显著增长。然而,现代生物医学数据集中缺失数据的普遍性和复杂性对深度学习方法构成了重大挑战。本文针对深度学习的广义线性模型(一种用于回归和分类问题的监督式深度学习架构)中的缺失数据问题,提供了形式化处理方法。我们提出了一种名为 \textit{dlglm} 的新架构,该架构是首批能够在训练过程中灵活解释输入特征及响应中可忽略与不可忽略缺失模式的模型之一。通过统计模拟,我们证明该方法在存在非随机缺失(MNAR)的情况下,其性能优于现有监督学习任务方法。最后,我们以UCI机器学习知识库中的银行营销数据集为案例进行研究,基于电话调查数据预测客户是否订阅了产品。本文补充材料可在线上获取。