Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to handle the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture which omits these complexities. Our training process consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. The use of a generative model unlocks the exploration of mask completion or inpainting, which has applications in interactive segmentation. The experimental validation yields promising results for both panoptic segmentation and mask inpainting. While not setting a new state-of-the-art, our model's simplicity, generality, and mask completion capability are desirable properties.
翻译:全景和实例分割网络通常通过专用的目标检测模块、复杂的损失函数以及针对实例遮罩排列不变性的临时后处理步骤进行训练。本研究基于Stable Diffusion,提出了一种用于全景分割的潜在扩散方法,构建了省略上述复杂性的简单架构。我们的训练过程包含两个步骤:(1)训练浅层自编码器将分割遮罩投影到潜在空间;(2)训练扩散模型以实现基于图像条件的潜在空间采样。生成模型的应用解锁了遮罩补全或修复功能,可应用于交互式分割。实验验证在全景分割和遮罩修复两方面均取得了令人鼓舞的结果。虽然未达到最新最优性能,但本模型的简洁性、通用性及遮罩补全能力仍具备理想特性。