Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to manage the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture that omits these complexities. Our training consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. This generative approach unlocks the exploration of mask completion or inpainting. The experimental validation on COCO and ADE20k yields strong segmentation results. Finally, we demonstrate our model's adaptability to multi-tasking by introducing learnable task embeddings.
翻译:全景与实例分割网络通常需要借助专门的目标检测模块、复杂的损失函数以及特定的后处理步骤来管理实例掩码的置换不变性。本研究基于Stable Diffusion,提出了一种用于全景分割的潜在扩散方法,从而构建出省略这些复杂步骤的简洁架构。我们的训练包含两个步骤:(1)训练浅层自编码器将分割掩码投影至潜在空间;(2)训练扩散模型以实现图像条件化的潜在空间采样。这种生成式方法为探索掩码补全或修复提供了可能。在COCO和ADE20k数据集上的实验验证取得了优异的分割效果。最后,我们通过引入可学习的任务嵌入向量,展示了模型在多任务处理中的适应性。