We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have shown the remarkable capability of generating high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We propose to leverage the frozen representation of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state-of-the-art. Project page is available at \url{https://jerryxu.net/ODISE}.
翻译:我们提出ODISE:基于扩散模型的开放词汇全景分割方法,该方法统一了预训练的文本-图像扩散模型与判别模型,以实现开放词汇全景分割。文本到图像扩散模型展现出根据多样化开放词汇语言描述生成高质量图像的卓越能力,表明其内部表征空间与现实世界的开放概念高度相关。而文本-图像判别模型(如CLIP)则擅长将图像分类至开放词汇标签。我们提出利用这两个模型的冻结表征,对任意野外部类进行全景分割。我们的方法在开放词汇全景分割与语义分割任务上均显著超越先前最先进水平。具体而言,仅使用COCO进行训练,我们的方法在ADE20K数据集上取得23.4 PQ和30.0 mIoU,较先前最优方法实现8.3 PQ与7.9 mIoU的绝对提升。项目主页见\url{https://jerryxu.net/ODISE}。