Solutions to vision tasks in gastrointestinal endoscopy (GIE) conventionally use image encoders pretrained in a supervised manner with ImageNet-1k as backbones. However, the use of modern self-supervised pretraining algorithms and a recent dataset of 100k unlabelled GIE images (Hyperkvasir-unlabelled) may allow for improvements. In this work, we study the fine-tuned performance of models with ResNet50 and ViT-B backbones pretrained in self-supervised and supervised manners with ImageNet-1k and Hyperkvasir-unlabelled (self-supervised only) in a range of GIE vision tasks. In addition to identifying the most suitable pretraining pipeline and backbone architecture for each task, out of those considered, our results suggest: that self-supervised pretraining generally produces more suitable backbones for GIE vision tasks than supervised pretraining; that self-supervised pretraining with ImageNet-1k is typically more suitable than pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy; and that ViT-Bs are more suitable in polyp segmentation and monocular depth estimation in colonoscopy, ResNet50s are more suitable in polyp detection, and both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation. We hope this work draws attention to the complexity of pretraining for GIE vision tasks, informs this development of more suitable approaches than the convention, and inspires further research on this topic to help advance this development. Code available: \underline{github.com/ESandML/SSL4GIE}
翻译:胃肠道内窥镜(GIE)视觉任务的解决方案通常采用以ImageNet-1k监督预训练的图像编码器作为骨干网络。然而,现代自监督预训练算法以及近期发布的10万张无标注GIE图像数据集(Hyperkvasir-unlabelled)可能带来性能提升。本研究系统评估了基于ResNet50和ViT-B骨干网络,在ImageNet-1k和Hyperkvasir-unlabelled(仅用于自监督)上进行自监督与监督预训练后,在多种GIE视觉任务中的微调性能。除确定每项任务最合适的预训练流程与骨干架构外,我们的结果表明:自监督预训练通常比监督预训练能为GIE视觉任务提供更优骨干网络;ImageNet-1k自监督预训练通常优于Hyperkvasir-unlabelled预训练,但结肠镜单目深度估计任务显著例外;ViT-B在结肠镜息肉分割与单目深度估计中更优,ResNet50在息肉检测中更优,而两种架构在解剖标志识别与病理发现表征任务中表现相近。本研究期望引起对GIE视觉任务预训练复杂性的关注,推动开发优于传统方法的更适方案,并激励该领域的进一步研究以促进这一进展。代码已开源:\underline{github.com/ESandML/SSL4GIE}