As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.
翻译:作为强大的生成模型,文本到图像扩散模型近期被探索用于判别性任务。一系列研究专注于将预训练的扩散模型适配至语义分割任务,且无需任何额外训练,从而形成了所谓的免训练扩散分割器。这些方法通常依赖于模型注意力层中的交叉注意力图,这些图被假定能够捕捉图像像素与文本标记之间的语义关系。理想情况下,此类方法应能从更强大的扩散模型中获益,即更强的生成能力应带来更好的分割效果。然而,我们观察到现有方法往往未能相应扩展。为理解此问题,我们识别出两个潜在差距:(i) 交叉注意力是在多个注意力头和层中计算的,但这些独立的注意力图与统一的全局表示之间存在差异。(ii) 即使存在全局图,由于不同文本标记之间的得分不平衡,它也不能直接转化为用于分割的准确语义关联。为弥合这些差距,我们提出了两种技术:自动聚合与逐像素重缩放,二者共同使免训练分割能够更好地利用生成能力。我们在标准语义分割基准上评估了我们的方法,并进一步将其集成到一种生成技术中,证明了其改进的性能与广泛的适用性。代码位于 https://github.com/Darkbblue/goca。