Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers

In the wake of Masked Image Modeling (MIM), a diverse range of plain, non-hierarchical Vision Transformer (ViT) models have been pre-trained with extensive datasets, offering new paradigms and significant potential for semantic segmentation. Current state-of-the-art systems incorporate numerous inductive biases and employ cumbersome decoders. Building upon the original motivations of plain ViTs, which are simplicity and generality, we explore high-performance `minimalist' systems to this end. Our primary purpose is to provide simple and efficient baselines for practical semantic segmentation with plain ViTs. Specifically, we first explore the feasibility and methodology for achieving high-performance semantic segmentation using the last feature map. As a result, we introduce the PlainSeg, a model comprising only three 3$\times$3 convolutions in addition to the transformer layers (either encoder or decoder). In this process, we offer insights into two underlying principles: (i) high-resolution features are crucial to high performance in spite of employing simple up-sampling techniques and (ii) the slim transformer decoder requires a much larger learning rate than the wide transformer decoder. On this basis, we further present the PlainSeg-Hier, which allows for the utilization of hierarchical features. Extensive experiments on four popular benchmarks demonstrate the high performance and efficiency of our methods. They can also serve as powerful tools for assessing the transfer ability of base models in semantic segmentation. Code is available at \url{https://github.com/ydhongHIT/PlainSeg}.

翻译：在掩码图像建模（MIM）的推动下，一系列基于非层级结构Plain Vision Transformer（ViT）的模型通过大规模数据集完成了预训练，为语义分割任务提供了全新的范式与巨大潜力。当前最先进的系统集成了大量归纳偏置，并采用了繁重的解码器。本文回归Plain ViT的原始设计理念（即简洁性与通用性），探索面向该任务的高性能"极简"系统。主要目标是为基于Plain ViT的实用语义分割提供简洁高效的基线方案。具体而言，我们首先探讨了利用最后一层特征图实现高性能语义分割的可行性及其方法论，由此提出了PlainSeg模型——该模型除Transformer层（编码器或解码器）外仅包含三个3×3卷积层。在此过程中，我们揭示了两项基本原则：（i）即使采用简单的上采样技术，高分辨率特征对高性能仍至关重要；（ii）窄型Transformer解码器需要比宽型解码器更大的学习率。在此基础上，我们进一步提出PlainSeg-Hier以支持层级化特征的使用。在四个主流基准数据集上的大量实验证明了我们方法的高性能与高效性，同时该方法还可作为评估基础模型在语义分割中迁移能力的强力工具。代码开源于\url{https://github.com/ydhongHIT/PlainSeg}。