In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.
翻译:在本工作中,我们针对多种分割任务展开研究,这些任务传统上由不同或部分统一的模型分别处理。我们提出了OMG-Seg,一个足以高效且有效地处理所有分割任务的模型,包括图像语义分割、实例分割和全景分割,以及它们的视频对应任务、开放词汇设置、提示驱动分割(如SAM)、交互式分割和视频对象分割。据我们所知,这是首个能够在一个模型中处理所有这些任务并取得满意性能的模型。我们展示了OMG-Seg——一种基于Transformer的编码器-解码器架构,配备任务特定的查询和输出——可以支持超过十种不同的分割任务,同时显著减少跨各种任务和数据集的计算量与参数开销。我们严格评估了协同训练过程中任务间的相互影响与关联。代码和模型可在 https://github.com/lxtGH/OMG-Seg 获取。