SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading

Self-supervised Learning (SSL) has been widely applied to learn image representations through exploiting unlabeled images. However, it has not been fully explored in the medical image analysis field. In this work, Saliency-guided Self-Supervised image Transformer (SSiT) is proposed for Diabetic Retinopathy (DR) grading from fundus images. We novelly introduce saliency maps into SSL, with a goal of guiding self-supervised pre-training with domain-specific prior knowledge. Specifically, two saliency-guided learning tasks are employed in SSiT: (1) Saliency-guided contrastive learning is conducted based on the momentum contrast, wherein fundus images' saliency maps are utilized to remove trivial patches from the input sequences of the momentum-updated key encoder. Thus, the key encoder is constrained to provide target representations focusing on salient regions, guiding the query encoder to capture salient features. (2) The query encoder is trained to predict the saliency segmentation, encouraging the preservation of fine-grained information in the learned representations. To assess our proposed method, four publicly-accessible fundus image datasets are adopted. One dataset is employed for pre-training, while the three others are used to evaluate the pre-trained models' performance on downstream DR grading. The proposed SSiT significantly outperforms other representative state-of-the-art SSL methods on all downstream datasets and under various evaluation settings. For example, SSiT achieves a Kappa score of 81.88% on the DDR dataset under fine-tuning evaluation, outperforming all other ViT-based SSL methods by at least 9.48%.

翻译：自监督学习（SSL）通过利用无标签图像学习图像表示已被广泛应用，但在医学图像分析领域尚未得到充分探索。本文提出基于显著图引导的自监督图像Transformer（SSiT），用于从眼底图像中进行糖尿病视网膜病变（DR）分级。我们创新性地将显著图引入SSL，旨在利用领域先验知识引导自监督预训练。具体而言，SSiT采用两种显著图引导的学习任务：（1）基于动量对比的显著图引导对比学习，利用眼底图像的显著图从动量更新关键编码器的输入序列中移除无关图像块。由此，关键编码器被约束为提供聚焦于显著区域的目标表示，从而引导查询编码器捕获显著特征。（2）训练查询编码器预测显著区域分割，促使学习表示中保留细粒度信息。为评估所提方法，我们采用了四个公开眼底图像数据集：一个数据集用于预训练，其余三个数据集用于评估预训练模型在下游DR分级任务中的性能。所提出的SSiT在所有下游数据集及不同评估设置下均显著优于其他代表性先进SSL方法。例如，在DDR数据集上进行微调评估时，SSiT的Kappa分数达到81.88%，相比所有其他基于ViT的SSL方法提升至少9.48%。