LMSeg: Language-guided Multi-dataset Segmentation

It's a meaningful and attractive topic to build a general and inclusive segmentation model that can recognize more categories in various scenarios. A straightforward way is to combine the existing fragmented segmentation datasets and train a multi-dataset network. However, there are two major issues with multi-dataset segmentation: (1) the inconsistent taxonomy demands manual reconciliation to construct a unified taxonomy; (2) the inflexible one-hot common taxonomy causes time-consuming model retraining and defective supervision of unlabeled categories. In this paper, we investigate the multi-dataset segmentation and propose a scalable Language-guided Multi-dataset Segmentation framework, dubbed LMSeg, which supports both semantic and panoptic segmentation. Specifically, we introduce a pre-trained text encoder to map the category names to a text embedding space as a unified taxonomy, instead of using inflexible one-hot label. The model dynamically aligns the segment queries with the category embeddings. Instead of relabeling each dataset with the unified taxonomy, a category-guided decoding module is designed to dynamically guide predictions to each datasets taxonomy. Furthermore, we adopt a dataset-aware augmentation strategy that assigns each dataset a specific image augmentation pipeline, which can suit the properties of images from different datasets. Extensive experiments demonstrate that our method achieves significant improvements on four semantic and three panoptic segmentation datasets, and the ablation study evaluates the effectiveness of each component.

翻译：构建一个通用且包容的分割模型，使其能够在不同场景中识别更多类别，是一个有意义且富有吸引力的课题。一种直接的方式是整合现有的碎片化分割数据集并训练一个多数据集网络。然而，多数据集分割面临两大问题：(1) 不一致的分类体系需要人工协调以构建统一的分类体系；(2) 僵化的独热统一分类体系会导致耗时的模型重训练以及对未标注类别的监督缺陷。本文研究了多数据集分割问题，提出了一种可扩展的语言引导多数据集分割框架——LMSeg，该框架同时支持语义分割和全景分割。具体而言，我们引入预训练的文本编码器，将类别名称映射到文本嵌入空间作为统一分类体系，而非使用僵化的独热标签。模型动态地将分割查询与类别嵌入对齐。为实现无需对每个数据集重新标注统一分类体系的目标，我们设计了类别引导解码模块，该模块可动态引导预测结果适配各数据集的分类体系。此外，我们采用数据集感知增强策略，为每个数据集分配特定的图像增强流程，以适配不同数据集的图像特性。大量实验表明，我们的方法在四个语义分割数据集和三个全景分割数据集上均取得了显著提升，消融研究验证了各模块的有效性。