This paper demonstrates a self-supervised framework for learning voxel-wise coarse-to-fine representations tailored for dense downstream tasks. Our approach stems from the observation that existing methods for hierarchical representation learning tend to prioritize global features over local features due to inherent architectural bias. To address this challenge, we devise a training strategy that balances the contributions of features from multiple scales, ensuring that the learned representations capture both coarse and fine-grained details. Our strategy incorporates 3-fold improvements: (1) local data augmentations, (2) a hierarchically balanced architecture, and (3) a hybrid contrastive-restorative loss function. We evaluate our method on CT and MRI data and demonstrate that our new approach particularly beneficial for fine-tuning with limited annotated data and consistently outperforms the baseline counterpart in linear evaluation settings.
翻译:本文提出了一种自监督框架,用于学习面向密集下游任务的体素级从粗到细表示。我们的方法源于以下观察:现有层级表示学习方法由于固有的架构偏差,倾向于优先关注全局特征而忽视局部特征。为解决这一挑战,我们设计了一种训练策略,平衡多尺度特征的贡献,确保所学表示同时捕获粗粒度与细粒度细节。该策略包含三项改进:(1)局部数据增强,(2)层级平衡架构,以及(3)混合对比-复原损失函数。我们在CT和MRI数据上进行了评估,结果表明,我们的新方法在标注数据有限的微调场景中尤其有效,并且在线性评估设置中始终优于基线方法。