Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.
翻译:全景分割要求为图像中的每个像素分配语义标签与实例ID标签。由于实例ID的排列组合均构成有效解,该任务需学习高维的一对多映射。因此,现有最优方法均采用定制化架构与任务特定损失函数。本文将全景分割构建为离散数据生成问题,而不依赖任务的归纳偏置。我们提出采用扩散模型对全景掩码进行建模,其架构简洁且采用通用损失函数。通过简单地将历史预测结果作为条件信号输入,该方法能够建模视频(流式处理场景),从而自动学习跟踪目标实例。大量实验表明,我们的简易方法在相似设定下可达到与专业方法相当的竞争性表现。