Pretraining CNN models (i.e., UNet) through self-supervision has become a powerful approach to facilitate medical image segmentation under low annotation regimes. Recent contrastive learning methods encourage similar global representations when the same image undergoes different transformations, or enforce invariance across different image/patch features that are intrinsically correlated. However, CNN-extracted global and local features are limited in capturing long-range spatial dependencies that are essential in biological anatomy. To this end, we present a keypoint-augmented fusion layer that extracts representations preserving both short- and long-range self-attention. In particular, we augment the CNN feature map at multiple scales by incorporating an additional input that learns long-range spatial self-attention among localized keypoint features. Further, we introduce both global and local self-supervised pretraining for the framework. At the global scale, we obtain global representations from both the bottleneck of the UNet, and by aggregating multiscale keypoint features. These global features are subsequently regularized through image-level contrastive objectives. At the local scale, we define a distance-based criterion to first establish correspondences among keypoints and encourage similarity between their features. Through extensive experiments on both MRI and CT segmentation tasks, we demonstrate the architectural advantages of our proposed method in comparison to both CNN and Transformer-based UNets, when all architectures are trained with randomly initialized weights. With our proposed pretraining strategy, our method further outperforms existing SSL methods by producing more robust self-attention and achieving state-of-the-art segmentation results. The code is available at https://github.com/zshyang/kaf.git.
翻译:通过自监督预训练CNN模型(如UNet)已成为低标注条件下促进医学图像分割的有效方法。最近的对比学习方法鼓励同一图像经过不同变换时具有相似的全局表示,或强制具有内在关联的不同图像/补丁特征之间保持不变性。然而,CNN提取的全局和局部特征在捕获生物解剖结构中至关重要的长程空间依赖关系方面存在局限性。为此,我们提出一种关键点增强融合层,可提取保留短程和长程自注意力的表示。具体而言,我们通过引入一个学习局部关键点特征间长程空间自注意力的额外输入,在多个尺度上增强CNN特征图。此外,我们引入了框架的全局和局部自监督预训练。在全局尺度上,我们从UNet的瓶颈处以及聚合多尺度关键点特征获得全局表示,并通过图像级对比目标对这些全局特征进行正则化。在局部尺度上,我们定义了基于距离的准则来首先建立关键点之间的对应关系,并促进其特征之间的相似性。通过在MRI和CT分割任务上的大量实验,当所有架构使用随机初始化权重训练时,我们证明了所提方法相较于基于CNN和Transformer的UNet的架构优势。通过所提预训练策略,我们的方法通过产生更稳健的自注意力并实现最先进的分割结果,进一步超越现有自监督学习方法。代码地址为https://github.com/zshyang/kaf.git。