We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state of the art. We open-source our code and models at https://github.com/NVlabs/ODISE .
翻译:我们提出ODISE:基于扩散模型的开词汇全景分割,该方法统一了预训练的文本-图像扩散模型与判别模型,以实现开词汇全景分割。文本到图像扩散模型具有显著能力,能根据多样化的开词汇语言描述生成高质量图像,这表明其内部表示空间与现实世界中的开放概念高度相关。另一方面,像CLIP这样的文本-图像判别模型擅长将图像分类为开词汇标签。我们利用这两个模型的冻结内部表示,对任意类别进行全景分割。我们的方法在开词汇全景分割和语义分割任务上均显著超越了先前的最优水平。具体而言,仅使用COCO训练数据,我们的方法在ADE20K数据集上达到了23.4 PQ和30.0 mIoU,相比先前最优方法分别实现了8.3 PQ和7.9 mIoU的绝对提升。我们在https://github.com/NVlabs/ODISE 开源了代码和模型。