Self-supervised learning (SSL) for clinical time series data has received significant attention in recent literature, since these data are highly rich and provide important information about a patient's physiological state. However, most existing SSL methods for clinical time series are limited in that they are designed for unimodal time series, such as a sequence of structured features (e.g., lab values and vitals signs) or an individual high-dimensional physiological signal (e.g., an electrocardiogram). These existing methods cannot be readily extended to model time series that exhibit multimodality, with structured features and high-dimensional data being recorded at each timestep in the sequence. In this work, we address this gap and propose a new SSL method -- Sequential Multi-Dimensional SSL -- where a SSL loss is applied both at the level of the entire sequence and at the level of the individual high-dimensional data points in the sequence in order to better capture information at both scales. Our strategy is agnostic to the specific form of loss function used at each level -- it can be contrastive, as in SimCLR, or non-contrastive, as in VICReg. We evaluate our method on two real-world clinical datasets, where the time series contains sequences of (1) high-frequency electrocardiograms and (2) structured data from lab values and vitals signs. Our experimental results indicate that pre-training with our method and then fine-tuning on downstream tasks improves performance over baselines on both datasets, and in several settings, can lead to improvements across different self-supervised loss functions.
翻译:自监督学习(SSL)在临床时间序列数据领域近年来备受关注,因这类数据蕴含丰富信息,能反映患者生理状态的关键特征。然而,现有临床时间序列SSL方法大多存在局限性:它们仅为单模态时间序列设计,例如结构化特征序列(如实验室指标和生命体征)或单一高维生理信号(如心电图)。这类方法难以直接扩展至建模具有多模态特征的时间序列——即序列中每个时间步骤均包含结构化特征与高维数据的复合记录。为填补这一空白,本文提出新型SSL方法——序列化多维自监督学习(Sequential Multi-Dimensional SSL),通过在整体序列层面与序列内独立高维数据点层面分别施加SSL损失,以更有效捕获双尺度信息。该策略不依赖特定损失函数形式(可选用对比学习如SimCLR,或非对比学习如VICReg)。我们在两个真实临床数据集上验证方法有效性,其中时间序列分别包含:(1)高频心电图序列;(2)实验室指标与生命体征的结构化数据序列。实验结果表明,采用本方法进行预训练并针对下游任务微调后,两个数据集的基线性能均得到提升,且在多种设置下可显著改善不同自监督损失函数的训练效果。