Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine-tuning, additional components and complex reasoning.
翻译:现有的大多数指代分割方法通常需要通过微调或组合多个预训练模型才能实现较强性能,这往往以额外的训练和架构修改为代价。与此同时,大规模生成式扩散模型编码了丰富的语义信息,使其成为有吸引力的通用特征提取器。在本工作中,我们提出了一种新方法,该方法直接利用扩散Transformer中的特征(即注意力分数)进行下游任务,既无需架构修改,也无需额外训练。为了系统评估这些特征,我们将基准测试扩展至涵盖图像和视频的视觉-语言 grounding 任务。我们的核心发现是:停用词充当了注意力磁体——它们会累积过剩的注意力,可通过过滤来降低噪声。此外,我们识别出在深层网络中出现的全局注意力汇(GAS),并证明其可被安全抑制或重定向至辅助标记,从而获得更清晰、更准确的 grounding 映射。我们进一步提出一种注意力重分配策略,通过附加的停用词将背景激活分割为更小的簇,从而生成更清晰、更局部化的热力图。基于这些发现,我们开发了 RefAM——一个简单的免训练 grounding 框架,它结合了交叉注意力映射、GAS 处理与重分配机制。在零样本指代图像与视频分割的基准测试中,我们的方法取得了强劲的性能,并在多数数据集上超越了先前方法,在无需微调、额外组件和复杂推理的情况下确立了新的技术标杆。