Zero-shot segmentation has recently shown notable improvement by leveraging the rich visual priors in large-scale text-to-image diffusion models, such as Stable Diffusion. However, current diffusion-based methods often face limitations due to the trade-off between spatial resolution and contextual information, as well as their reliance on a single static timestep for feature extraction. To overcome these challenges, our work introduces two key advancements. First, our Contextual Similarity Maps fuse high-resolution attention maps with rich U-Net encoder features, providing both fine-grained and robust per-pixel representations. Second, we identify an emergent hierarchical semantic progression within the denoising process of various diffusion models: representations transition from part-level abstractions at earlier timesteps to object-level abstractions at later stages. Leveraging this insight, we introduce a mechanism to adaptively select the optimal timestep for each pixel. Extensive experiments demonstrate that our method consistently outperforms existing zero-shot segmentation baselines, validating the efficacy of combining contextual features with dynamic, hierarchical timestep selection.
翻译:零样本分割近期通过利用大规模文本到图像扩散模型(如Stable Diffusion)中丰富的视觉先验知识,取得了显著提升。然而,现有基于扩散的方法常受限于空间分辨率与上下文信息之间的权衡,以及依赖单一静态时间步长进行特征提取的缺陷。为克服这些挑战,本文提出两项关键进展。首先,我们的上下文相似度图将高分辨率注意力图与丰富的U-Net编码器特征相融合,从而提供兼具精细和鲁棒性的逐像素表示。其次,我们识别出多种扩散模型去噪过程中涌现的层级化语义递进规律:表示从早期时间步长的部件级抽象逐渐过渡到后期阶段的对象级抽象。基于这一发现,我们引入了一种自适应为每个像素选择最优时间步长的机制。大量实验表明,我们的方法持续优于现有零样本分割基线,验证了将上下文特征与动态层级化时间步长选择相结合的有效性。