The vast amount of health data has been continuously collected for each patient, providing opportunities to support diverse healthcare predictive tasks such as seizure detection and hospitalization prediction. Existing models are mostly trained on other patients data and evaluated on new patients. Many of them might suffer from poor generalizability. One key reason can be overfitting due to the unique information related to patient identities and their data collection environments, referred to as patient covariates in the paper. These patient covariates usually do not contribute to predicting the targets but are often difficult to remove. As a result, they can bias the model training process and impede generalization. In healthcare applications, most existing domain generalization methods assume a small number of domains. In this paper, considering the diversity of patient covariates, we propose a new setting by treating each patient as a separate domain (leading to many domains). We develop a new domain generalization method ManyDG, that can scale to such many-domain problems. Our method identifies the patient domain covariates by mutual reconstruction and removes them via an orthogonal projection step. Extensive experiments show that ManyDG can boost the generalization performance on multiple real-world healthcare tasks (e.g., 3.7% Jaccard improvements on MIMIC drug recommendation) and support realistic but challenging settings such as insufficient data and continuous learning.
翻译:针对每位患者持续收集的海量健康数据,为癫痫检测、住院预测等多样化医疗预测任务提供了支撑。现有模型多基于其他患者数据训练,并在新患者上评估,往往面临泛化能力不足的问题。关键原因在于模型过度拟合了与患者身份及其数据采集环境相关的独特信息(本文称为患者协变量)。这些协变量通常无助于目标预测,却难以消除,从而扭曲模型训练过程并阻碍泛化。在医疗应用中,现有域泛化方法大多假设域的数量较少。本文基于患者协变量的多样性,将每位患者视为独立域(形成多域场景),提出新设定。我们开发了可扩展至此类多域问题的新型域泛化方法ManyDG,通过互重构识别患者域协变量,并利用正交投影步骤将其移除。大量实验表明,ManyDG可在多个真实医疗任务中提升泛化性能(如MIMIC药物推荐任务中Jaccard指标提升3.7%),并支持数据不足、持续学习等现实且具挑战性的场景。