It has recently been demonstrated that pretraining backbones in a self-supervised manner generally provides better fine-tuned polyp segmentation performance, and that models with ViT-B backbones typically perform better than models with ResNet50 backbones. In this paper, we extend this recent work to consider generalisability. I.e., we assess the performance of models on a different dataset to that used for fine-tuning, accounting for variation in network architecture and pretraining pipeline (algorithm and dataset). This reveals how well models with different pretrained backbones generalise to data of a somewhat different distribution to the training data, which will likely arise in deployment due to different cameras and demographics of patients, amongst other factors. We observe that the previous findings, regarding pretraining pipelines for polyp segmentation, hold true when considering generalisability. However, our results imply that models with ResNet50 backbones typically generalise better, despite being outperformed by models with ViT-B backbones in evaluation on the test set from the same dataset used for fine-tuning.
翻译:近期研究表明,通过自监督方式预训练的主干网络通常能提供更优的微调后息肉分割性能,且采用ViT-B主干网络的模型普遍优于采用ResNet50主干网络的模型。本文在已有研究基础上进一步探讨其泛化能力,即评估模型在不同于微调所用数据集上的表现,同时考虑网络架构与预训练流程(算法及数据集)的差异。这揭示了采用不同预训练主干网络的模型对训练数据分布存在差异的新数据的适应能力,这种差异在实际部署中可能源于不同摄像设备、患者群体特征等因素。我们发现,先前关于息肉分割预训练流程的结论在考虑泛化能力时依然成立。然而,实验结果表明:尽管在微调所用同源数据集的测试集评估中,采用ResNet50主干网络的模型性能不及采用ViT-B主干网络的模型,但其通常展现出更优的泛化能力。