Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation

Detection of tumors in metastatic colorectal cancer (mCRC) plays an essential role in the early diagnosis and treatment of liver cancer. Deep learning models backboned by fully convolutional neural networks (FCNNs) have become the dominant model for segmenting 3D computerized tomography (CT) scans. However, since their convolution layers suffer from limited kernel size, they are not able to capture long-range dependencies and global context. To tackle this restriction, vision transformers have been introduced to solve FCNN's locality of receptive fields. Although transformers can capture long-range features, their segmentation performance decreases with various tumor sizes due to the model sensitivity to the input patch size. While finding an optimal patch size improves the performance of vision transformer-based models on segmentation tasks, it is a time-consuming and challenging procedure. This paper proposes a technique to select the vision transformer's optimal input multi-resolution image patch size based on the average volume size of metastasis lesions. We further validated our suggested framework using a transfer-learning technique, demonstrating that the highest Dice similarity coefficient (DSC) performance was obtained by pre-training on training data with a larger tumour volume using the suggested ideal patch size and then training with a smaller one. We experimentally evaluate this idea through pre-training our model on a multi-resolution public dataset. Our model showed consistent and improved results when applied to our private multi-resolution mCRC dataset with a smaller average tumor volume. This study lays the groundwork for optimizing semantic segmentation of small objects using vision transformers. The implementation source code is available at:https://github.com/Ramtin-Mojtahedi/OVTPS.

翻译：转移性结直肠癌（mCRC）中肿瘤的检测在肝癌早期诊断与治疗中起着关键作用。基于全卷积神经网络（FCNN）的深度学习模型已成为三维计算机断层扫描（CT）图像分割的主流方法。然而，由于卷积层的核大小受限，这类模型无法捕获长距离依赖关系和全局上下文信息。为突破这一限制，研究者引入视觉Transformer以解决FCNN感受野局部化的问题。尽管Transformer能捕捉长距离特征，但其分割性能会因肿瘤尺寸差异而下降，因为模型对输入补丁大小敏感。虽然寻找最优补丁尺寸能提升基于视觉Transformer模型的分割任务性能，但这一过程耗时且具有挑战性。本文提出一种基于转移病灶平均体积大小选择视觉Transformer最优输入多分辨率图像补丁尺寸的技术。我们进一步采用迁移学习技术验证了所提框架，结果表明：在具有较大肿瘤体积的训练数据上使用建议最优补丁尺寸进行预训练，再以较小补丁尺寸进行微调，可获得最高的Dice相似系数（DSC）性能。通过在公开多分辨率数据集上预训练模型进行实验验证，当将其应用于平均肿瘤体积较小的私有多分辨率mCRC数据集时，模型展现出稳定且更优的结果。本研究为优化视觉Transformer的小目标语义分割奠定了基础。实现源代码见：https://github.com/Ramtin-Mojtahedi/OVTPS