Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.
翻译:全景分割为图像中的每个像素分配语义标签和实例ID。由于实例ID的排列组合同样构成有效解,该任务需要学习高维度的一对多映射。因此,现有最优方法通常采用定制化架构和任务特定损失函数。我们将全景分割建模为离散数据生成问题,无需依赖该任务的归纳偏置。我们提出了一种扩散模型来建模全景掩膜,该模型采用简洁架构和通用损失函数。通过简单地将历史预测结果作为条件信号输入,本方法能够以流式方式处理视频数据,并自动学习跟踪目标实例。大量实验表明,我们的简单方法在相似设置下可与最先进的专业方法竞争。