In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.
翻译:在这项工作中,我们处理了多种分割任务,这些任务传统上由不同或部分统一的模型单独解决。我们提出OMG-Seg,一个足以高效且有效地处理所有分割任务的单一模型,包括图像语义分割、实例分割、全景分割,以及它们的视频对应任务、开放词汇设置、提示驱动分割(如SAM)和视频对象分割。据我们所知,这是首个能够用一个模型处理所有这些任务并达到满意性能的模型。我们展示了OMG-Seg——一种基于Transformer的编码器-解码器架构,配备任务特定的查询和输出——可以支持超过十种不同的分割任务,同时显著降低跨多种任务和数据集的计算和参数开销。我们严格评估了协同训练过程中的任务间影响和相关性。代码和模型已发布于https://github.com/lxtGH/OMG-Seg。