Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt{[CLS]} features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.
翻译:视觉Transformer(ViT)能够学习强图像级表示,但在长时间训练过程中,其补丁表示在密集预测任务中会变得效果不佳。我们重新审视这一密集退化现象,并认为其不能仅由高范数伪影完全解释。相反,我们将其表征为“语义扩散”:一种优化捷径,其中全局语义信息通过补丁令牌传播,超出了局部合理的范围。我们的分析表明,密集表示质量不能仅由局部性来衡量:浅层特征虽能与前景区域保持更好对齐但性能不如深层特征,而[CLS]特征在密集预测中仍保持互补性。这些观察表明,目标不应是消除全局上下文,而是使令牌交互更具选择性。因此,我们研究稀疏注意力作为最小干预手段,在保持全局令牌连通性的同时,用entmax-1.5替代softmax注意力。在ImageNet-1K上训练200个epoch的DINOv1 ViT-S/16上,这一改变保持了ImageNet线性探测准确率,并显著提升了语义分割性能:VOC mIoU从42.80提升至48.78,ADE20K从19.85提升至21.97,Cityscapes从36.79提升至37.87。这些结果表明,选择性令牌混合是改进密集ViT表示的简单有效偏置。