This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation. The code will be available at https://github.com/MendelXu/SAN.
翻译:本文提出了一种基于预训练视觉语言模型的开放词汇语义分割新框架,命名为侧适配器网络(SAN)。我们的方法将语义分割任务建模为区域识别问题。在冻结的CLIP模型上附加一个侧网络,该网络包含两个分支:一个用于预测掩码提案,另一个用于预测注意力偏置,该偏置应用于CLIP模型中以识别掩码的类别。这种解耦设计使得CLIP在识别掩码提案类别时受益。由于附加的侧网络可以复用CLIP特征,因此它非常轻量。此外,整个网络可以端到端训练,使侧网络能够适应冻结的CLIP模型,从而使预测的掩码提案具有CLIP感知能力。我们的方法快速、准确,且仅增加少量可训练参数。我们在多个语义分割基准上评估了该方法。我们的方法显著优于其他同类方法,可训练参数最多减少18倍,推理速度提升19倍。我们希望我们的方法能作为坚实的基线,有助于简化未来开放词汇语义分割的研究。代码将在https://github.com/MendelXu/SAN开源。