This paper introduces vox2vec - a contrastive method for self-supervised learning (SSL) of voxel-level representations. vox2vec representations are modeled by a Feature Pyramid Network (FPN): a voxel representation is a concatenation of the corresponding feature vectors from different pyramid levels. The FPN is pre-trained to produce similar representations for the same voxel in different augmented contexts and distinctive representations for different voxels. This results in unified multi-scale representations that capture both global semantics (e.g., body part) and local semantics (e.g., different small organs or healthy versus tumor tissue). We use vox2vec to pre-train a FPN on more than 6500 publicly available computed tomography images. We evaluate the pre-trained representations by attaching simple heads on top of them and training the resulting models for 22 segmentation tasks. We show that vox2vec outperforms existing medical imaging SSL techniques in three evaluation setups: linear and non-linear probing and end-to-end fine-tuning. Moreover, a non-linear head trained on top of the frozen vox2vec representations achieves competitive performance with the FPN trained from scratch while having 50 times fewer trainable parameters. The code is available at https://github.com/mishgon/vox2vec .
翻译:本文介绍vox2vec——一种用于体素级表示自监督学习(SSL)的对比方法。vox2vec表示由特征金字塔网络(FPN)建模:体素表示是不同金字塔层级对应特征向量的拼接。该FPN经过预训练,使得同一体素在不同增强上下文中的表示相似,而不同体素的表示具有区分性。最终得到统一的多尺度表示,既能捕获全局语义(如身体部位),也能捕获局部语义(如不同小器官或健康组织与肿瘤组织)。我们使用vox2vec在超过6500张公开可用的计算机断层扫描图像上预训练FPN。通过在预训练表示上附加简易头部网络并训练所得模型,我们在22个分割任务中评估了预训练表示。实验表明,在三种评估设置(线性探测、非线性探测和端到端微调)下,vox2vec均优于现有医学影像SSL技术。此外,基于冻结的vox2vec表示训练的非线性头部网络,在可训练参数减少50倍的情况下,仍能达到与从头训练的FPN相当的性能。代码已开源至https://github.com/mishgon/vox2vec。