Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.
翻译:文本到图像扩散模型注意力图中蕴含的视觉-文本关联已被证明对密集视觉预测任务(如语义分割)具有显著助益。然而,由于图像生成所用的富含上下文的语句与语义分割通常使用的孤立类别名称之间存在输入分布差异,这一差异阻碍了扩散模型捕获准确的视觉-文本关联。为解决此问题,我们提出InvSeg,一种测试时提示词反演方法。该方法通过将图像特定的视觉上下文反演至文本提示词嵌入空间,利用扩散模型重建过程中提取的结构信息来丰富文本提示,从而使每个类别与结构一致的掩码相关联,以此实现开放词汇语义分割。具体而言,我们引入对比软聚类(CSC)方法,将生成的掩码与图像结构信息对齐,为每个类别软选择锚点并计算加权距离,以拉近类内像素距离同时分离类间像素,从而确保掩码的区分度与内部一致性。通过融入样本特定的上下文,InvSeg在嵌入空间中学习富含上下文的文本提示,并实现跨模态的精确语义对齐。实验表明,InvSeg在PASCAL VOC、PASCAL Context和COCO Object数据集上均取得了最先进的性能。