This paper introduces an approach, named DFormer, for universal image segmentation. The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model. DFormer first adds various levels of Gaussian noise to ground-truth masks, and then learns a model to predict denoising masks from corrupted masks. Specifically, we take deep pixel-level features along with the noisy masks as inputs to generate mask features and attention masks, employing diffusion-based decoder to perform mask prediction gradually. At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks. Extensive experiments reveal the merits of our proposed contributions on different image segmentation tasks: panoptic segmentation, instance segmentation, and semantic segmentation. Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set. Further, DFormer achieves promising semantic segmentation performance outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our source code and models will be publicly on https://github.com/cp3wan/DFormer
翻译:本文提出了一种名为DFormer的通用图像分割方法。所提出的DFormer将通用图像分割任务视为利用扩散模型的去噪过程。DFormer首先对真实标注掩码添加不同级别的高斯噪声,然后学习一个模型从带噪掩码中预测去噪掩码。具体而言,我们将深层像素级特征与带噪掩码作为输入,生成掩码特征和注意力掩码,并采用基于扩散的解码器逐步完成掩码预测。在推理阶段,DFormer直接从一组随机生成的掩码中预测掩码及其对应类别。大量实验揭示了所提贡献在不同图像分割任务上的优势:全景分割、实例分割和语义分割。我们的DFormer在MS COCO val2017数据集上以3.6%的性能提升优于近期基于扩散的全景分割方法Pix2Seq-D。此外,DFormer在ADE20K验证集上以2.2%的语义分割性能提升优于近期基于扩散的方法。我们的源代码和模型将公布于https://github.com/cp3wan/DFormer。