Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at $k{=}1$; only under stronger interventions ($k\!\geq\!10$) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emph{sink-specific} -- $\sim\!6\times$ larger than equal-budget random masking -- revealing an empirical dissociation between trajectory-level perturbation and \emph{semantic alignment} in diffusion transformers. \footnote{Code available at https://github.com/wfz666/ICML26-attention-sink.}
翻译:注意力沉洞——即获得不成比例注意力权重的标记——在自回归语言模型中被认为具有功能性重要性,但其在扩散Transformer中的作用尚不明确。我们在文本到图像扩散中提出一项因果分析,动态识别每个时间步中的主导注意力接收者,并通过配对的、无需训练的干预(作用于得分路径与值路径)抑制它们。在Stable Diffusion 3(结合SDXL验证)的553个GenEval提示中,移除这些沉洞不会降低文本-图像对齐(CLIP-T)或偏好代理指标(ImageReward, HPS-v2)在$k{=}1$时的性能;仅在更强的干预($k\!\geq\!10$)下,HPS-v2才表现出指标依赖性边界,而CLIP-T在整个过程中保持稳健。然而,由抑制引起的感知偏移具有沉洞特异性——比等预算随机掩蔽大约$6$倍——揭示了扩散Transformer中轨迹层面扰动与语义对齐之间的经验性分离。\footnote{代码见:https://github.com/wfz666/ICML26-attention-sink}