Self-supervised learning improves robustness of deep learning lung tumor segmentation to CT imaging differences

Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.

翻译：自监督学习（SSL）是一种从无标注数据中提取有用特征表示的方法，并能通过有限标注样本在下游任务上进行微调。自预训练是一种SSL方法，它使用精心整理的任务数据集同时进行网络预训练和微调。大规模、多样化且未经过整理的公开医学影像数据集为在"野生"环境中应用SSL提供了机会，并有可能提取出对成像变异具有鲁棒性的特征。然而，在医学图像分析领域，野生预训练与自预训练各自的优势尚未得到系统研究。本文比较了野生预训练与自预训练的Transformer模型（视觉Transformer [ViT] 和层次化滑动窗口 [Swin]）在非小细胞肺癌（NSCLC）分割中对计算机断层扫描（CT）成像差异的鲁棒性。实验表明：针对不同成像采集方式，野生预训练的Swin模型表现优于自预训练Swin模型；ViT在野生预训练与自预训练模型中均获得相近的准确率。强制网络学习局部结构的掩码图像预测 pretext 任务，其准确率高于建模全局图像信息的对比学习任务。经过微调后，野生预训练模型在低层网络层具有更高的特征复用率，并在输出层附近呈现更强的特征区分性。因此我们得出结论：在肺肿瘤分割任务中，野生预训练网络对CT成像差异的鲁棒性优于自预训练方法。Swin架构从该类预训练中获得的收益高于ViT。