We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
翻译:我们研究了视觉Transformer(ViT)中一个关键但常被忽视的内在问题:这些模型的特征图呈现出网格状伪影,这损害了ViT在下游密集预测任务(如语义分割、深度预测和物体发现)中的性能。我们将此问题溯源至输入阶段的位置嵌入。为缓解此问题,我们提出一种两阶段去噪方法,称为去噪视觉Transformer(DVT)。在第一阶段,我们通过基于单张图像在神经场中强制跨视图特征一致性,将受位置伪影污染的特征与干净特征分离。这一逐图像优化过程从原始ViT输出中提取无伪影特征,为离线应用提供干净的特征估计。在第二阶段,我们训练一个轻量级Transformer模块,利用推导出的干净特征估计作为监督,从原始ViT输出中预测干净特征。我们的方法DVT无需重新训练已有的预训练ViT,并可立即应用于任何视觉Transformer架构。我们在多种代表性ViT(DINO、DeiT-III、EVA02、CLIP、DINOv2、DINOv2-reg)上评估了我们的方法,结果表明DVT在多个数据集的语义和几何任务中,能持续提升现有最先进通用模型的性能。我们希望我们的研究能促使人们重新评估ViT的设计,特别是关于位置嵌入的简单使用。我们的代码和模型检查点已公开提供。