Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.
翻译:心电图(ECG)信号的准确解读对于诊断心血管疾病至关重要。近期整合心电图与伴随临床报告的多模态方法显示出巨大潜力,但从模态视角看仍面临两个主要问题:(1)模态内:现有模型以导联无关的方式处理心电图,忽略了导联间的时空依赖性,这限制了其建模细粒度诊断模式的有效性;(2)模态间:现有方法直接将心电图信号与临床报告对齐,由于报告的自由文本特性而引入了模态特定偏差。针对这两个问题,我们提出了CG-DMER,一个用于解耦多模态心电图表征学习的对比-生成框架,其核心由两个关键设计驱动:(1)时空掩码建模:通过在空间和时间维度上应用掩码并重建缺失信息,以更好地捕捉细粒度时间动态和导联间空间依赖性。(2)表征解耦与对齐策略:通过引入模态特定编码器和模态共享编码器,确保模态不变表征与模态特定表征之间更清晰的分离,从而减少不必要的噪声和模态特定偏差。在三个公开数据集上的实验表明,CG-DMER在多种下游任务中均实现了最先进的性能。