In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 68.1% on the Cityscapes, without accessing any real urban-scene datasets.Code is available at https://github.com/w1oves/Rein.git.
翻译:本文首先评估并利用多种视觉基础模型(Vision Foundation Models, VFMs)在域泛化语义分割(Domain Generalized Semantic Segmentation, DGSS)中的表现。受“利用更强的预训练模型和更少的可训练参数以获得更优泛化能力”这一动机驱动,我们提出了一种名为Rein的鲁棒微调方法,以参数高效的方式利用VFMs解决DGSS问题。该方法基于一组与不同实例相关联的可训练令牌(tokens),对主干网络中每一层的特征图进行精确优化,并将其传递至下一层,从而为单幅图像中的不同类别生成多样化的改进。通过更少的可训练参数,Rein高效地微调VFMs用于DGSS任务,并出人意料地超越了全参数微调的效果。在多种设置下的广泛实验表明,Rein显著优于当前最先进方法。值得注意的是,仅在冻结的主干网络中额外引入1%的可训练参数,Rein在不访问任何真实城市场景数据集的情况下,即可在Cityscapes上达到68.1%的mIoU。代码已开源在 https://github.com/w1oves/Rein.git。