In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5$\rightarrow$Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.
翻译:本文提出一种利用视觉-语言模型文本嵌入中的领域不变语义知识来解决领域泛化语义分割问题的方法。我们将文本嵌入作为基于Transformer的分割框架中的对象查询(文本对象查询)。这些查询被视为DGSS中像素分组的领域不变基。为充分发挥文本对象查询的效能,我们提出名为文本查询驱动掩码Transformer的新型框架。该框架旨在:(1)生成能最大化编码领域不变语义的文本对象查询;(2)增强稠密视觉特征的语义清晰度。此外,我们提出三种正则化损失,通过对齐视觉与文本特征来提升框架性能。应用本方法后,模型能够理解目标类别的内在语义信息,从而泛化至极端领域(如素描风格)。本框架在GTA5→Cityscapes数据集上达到68.9%的mIoU,较先前最优方法提升2.5%。项目页面详见https://byeonghyunpak.github.io/tqdm。