Semantic segmentation has a broad range of applications in a variety of domains including land coverage analysis, autonomous driving, and medical image analysis. Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. Even though ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection since ViT is not a general purpose backbone due to its patch partitioning scheme. In this survey, we discuss some of the different ViT architectures that can be used for semantic segmentation and how their evolution managed the above-stated challenge. The rise of ViT and its performance with a high success rate motivated the community to slowly replace the traditional convolutional neural networks in various computer vision tasks. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets. This will be worthwhile for the community to yield knowledge regarding the implementations carried out in semantic segmentation and to discover more efficient methodologies using ViTs.
翻译:语义分割在土地覆盖分析、自动驾驶和医学图像分析等多个领域具有广泛的应用。卷积神经网络(CNN)和视觉Transformer(ViTs)为语义分割提供了架构模型。尽管ViT在图像分类中取得了成功,但由于其分块划分机制并非通用主干网络,因此无法直接应用于图像分割和目标检测等密集预测任务。在本综述中,我们讨论了可用于语义分割的不同ViT架构,以及这些架构的演进如何应对上述挑战。ViT的兴起及其高成功率的表现,促使学界逐步在各类计算机视觉任务中替代传统卷积神经网络。本文旨在通过基准数据集,系统回顾并比较专为语义分割设计的ViT架构的性能。这有助于学界深入了解语义分割的实现方法,并探索利用ViT的更高效方法论。