Denoising Vision Transformers

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.

翻译：我们深入探讨了视觉变压器（Vision Transformers, ViTs）中一个细微但重要的挑战：这些模型的特征图呈现出网格状伪影，这对ViT在下游任务中的性能造成不利影响。我们的研究将这一根本问题追溯至输入阶段的位置嵌入。为解决此问题，我们提出了一种对所有ViT普遍适用的新型噪声模型。具体而言，该噪声模型将ViT输出分解为三个部分：一个无噪声伪影的语义项，以及两个与像素位置相关的伪影项。这种分解通过在每张图像上利用神经场强制视图间特征一致性来实现。这种逐张图像优化过程从原始ViT输出中提取无伪影特征，为离线应用提供干净的特征。为扩展解决方案以支持在线功能，我们引入了一个可学习的去噪器，直接从未处理的ViT输出中预测无伪影特征，展现出对新颖数据的显著泛化能力，无需逐张图像优化。我们的两阶段方法称为去噪视觉变压器（Denoising Vision Transformers, DVT），无需重新训练现有的预训练ViT，即可立即应用于任何基于Transformer的架构。我们在多种代表性ViT（DINO、MAE、DeiT-III、EVA02、CLIP、DINOv2、DINOv2-reg）上评估了我们的方法。大量评估表明，我们的DVT在多个数据集的语义和几何任务上持续且显著地提升了现有最先进通用模型的性能（例如+3.84 mIoU）。我们希望本研究能促使对ViT设计进行重新评估，特别是在位置嵌入的简单使用方面。