A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy

Solutions to vision tasks in gastrointestinal endoscopy (GIE) conventionally use image encoders pretrained in a supervised manner with ImageNet-1k as backbones. However, the use of modern self-supervised pretraining algorithms and a recent dataset of 100k unlabelled GIE images (Hyperkvasir-unlabelled) may allow for improvements. In this work, we study the fine-tuned performance of models with ResNet50 and ViT-B backbones pretrained in self-supervised and supervised manners with ImageNet-1k and Hyperkvasir-unlabelled (self-supervised only) in a range of GIE vision tasks. In addition to identifying the most suitable pretraining pipeline and backbone architecture for each task, out of those considered, our results suggest three general principles. Firstly, that self-supervised pretraining generally produces more suitable backbones for GIE vision tasks than supervised pretraining. Secondly, that self-supervised pretraining with ImageNet-1k is typically more suitable than pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy. Thirdly, that ViT-Bs are more suitable in polyp segmentation and monocular depth estimation in colonoscopy, ResNet50s are more suitable in polyp detection, and both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation. We hope this work draws attention to the complexity of pretraining for GIE vision tasks, informs this development of more suitable approaches than the convention, and inspires further research on this topic to help advance this development. Code available: \underline{github.com/ESandML/SSL4GIE}

翻译：胃肠内窥镜（GIE）视觉任务的解决方案传统上使用以监督方式在ImageNet-1k上预训练的图像编码器作为骨干网络。然而，现代自监督预训练算法以及近期一个包含10万张未标注GIE图像的数据集（Hyperkvasir-unlabelled）的应用可能带来性能提升。在本工作中，我们研究了以ResNet50和ViT-B为骨干的模型，分别通过自监督和监督方式在ImageNet-1k上预训练，以及仅在Hyperkvasir-unlabelled数据集上以自监督方式预训练（仅限自监督）后，在一系列GIE视觉任务上的微调性能。除了为每项任务确定所考虑范围内最合适的预训练流程和骨干架构外，我们的结果还提出了三个一般性原则。首先，自监督预训练通常比监督预训练能产生更适用于GIE视觉任务的骨干网络。其次，使用ImageNet-1k进行自监督预训练通常比使用Hyperkvasir-unlabelled更合适，但结肠镜单目深度估计任务是一个显著例外。第三，ViT-B在息肉分割和结肠镜单目深度估计中更适用，ResNet50在息肉检测中更适用，而在解剖标志物识别和病理特征描述任务中，两种架构表现相似。我们希望这项工作能引起对GIE视觉任务预训练复杂性的关注，为开发比传统方法更合适的方案提供参考，并激励该领域的进一步研究以推动其发展。代码地址：\underline{github.com/ESandML/SSL4GIE}