Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding.Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups,achieving superior segmentation transfer and strong downstream performance on clinically relevant tasks such as LVEF prediction. Code available at: https://github.com/mdivyanshu97/DISCOVR

翻译：自监督学习（SSL）在自然图像与视频理解领域已取得重大进展，但在超声心动图（心脏超声）等医学影像领域仍面临挑战，这主要源于其细微的解剖结构、复杂的时序动态特性以及当前缺乏领域专用预训练模型。现有SSL方法（如对比学习、掩码建模及基于聚类的方法）在处理样本间高度相似性、对超声图像中常见的低峰值信噪比输入的敏感性，或可能扭曲临床相关特征的激进数据增强时存在局限。本文提出DISCOVR（跨模态视频表示的蒸馏图像监督），一种用于心脏超声视频表示学习的自监督双分支框架。DISCOVR结合了基于聚类的视频编码器（建模时序动态）与在线图像编码器（提取细粒度空间语义），通过语义聚类蒸馏损失将解剖知识从持续演进的图像编码器迁移至视频编码器，从而获得具有时序一致性且富含细粒度语义理解的视频表示。在涵盖胎儿、儿童及成人群体的六个超声心动图数据集上的评估表明，在零样本与线性探测设置下，DISCOVR在视频异常检测专用方法与前沿视频SSL基线模型中均表现更优，在左心室射血分数预测等临床相关任务中实现了卓越的分割迁移能力与强大的下游性能。代码发布于：https://github.com/mdivyanshu97/DISCOVR