Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.
翻译:扩散模型主要被训练用于图像合成,但其去噪轨迹编码了丰富的、空间对齐的视觉先验。在本文中,我们证明这些先验可用于文本条件下的语义分割和开放词汇分割,且该方法可泛化至多种下游任务,构建通用的扩散分割框架。具体而言,我们提出DiGSeg(扩散模型作为通用分割学习器),将预训练的扩散模型重新用于统一的分割框架。我们的方法将输入图像和真实掩膜编码到潜在空间,并将它们拼接作为扩散U-Net的条件信号。一个并行的、与CLIP对齐的文本路径在多尺度上注入语言特征,使模型能够将文本查询与动态演变的视觉表示对齐。这一设计将现成的扩散骨干转化为通用接口,基于外观和任意文本提示生成结构化的分割掩膜。大量实验表明,该方法在标准语义分割基准上达到最先进性能,同时展现出强大的开放词汇泛化能力以及跨领域迁移能力(涵盖医学、遥感和农业场景),且无需领域特定的架构定制。这些结果表明,现代扩散骨干可作为通用分割学习器而非纯粹的生成器,从而缩小了视觉生成与视觉理解之间的差距。