Sub-token ViT Embedding via Stochastic Resonance Transformers

We discover the presence of quantization artifacts in Vision Transformers (ViTs), which arise due to the image tokenization step inherent in these architectures. These artifacts result in coarsely quantized features, which negatively impact performance, especially on downstream dense prediction tasks. We present a zero-shot method to improve how pre-trained ViTs handle spatial quantization. In particular, we propose to ensemble the features obtained from perturbing input images via sub-token spatial translations, inspired by Stochastic Resonance, a method traditionally applied to climate dynamics and signal processing. We term our method ``Stochastic Resonance Transformer" (SRT), which we show can effectively super-resolve features of pre-trained ViTs, capturing more of the local fine-grained structures that might otherwise be neglected as a result of tokenization. SRT can be applied at any layer, on any task, and does not require any fine-tuning. The advantage of the former is evident when applied to monocular depth prediction, where we show that ensembling model outputs are detrimental while applying SRT on intermediate ViT features outperforms the baseline models by an average of 4.7% and 14.9% on the RMSE and RMSE-log metrics across three different architectures. When applied to semi-supervised video object segmentation, SRT also improves over the baseline models uniformly across all metrics, and by an average of 2.4% in F&J score. We further show that these quantization artifacts can be attenuated to some extent via self-distillation. On the unsupervised salient region segmentation, SRT improves upon the base model by an average of 2.1% on the maxF metric. Finally, despite operating purely on pixel-level features, SRT generalizes to non-dense prediction tasks such as image retrieval and object discovery, yielding consistent improvements of up to 2.6% and 1.0% respectively.

翻译：我们发现视觉变换器（ViTs）中存在由图像标记化步骤引起的量化伪影。这些伪影导致特征粗糙量化，对模型性能产生负面影响，尤其在密集预测下游任务中。我们提出一种零样本方法以改善预训练ViT处理空间量化的能力。具体而言，受随机共振（一种常用于气候动力学和信号处理的方法）启发，我们通过子词元空间平移扰动输入图像，集成由此产生的特征。我们将该方法命名为“随机共振变换器”（SRT），实验表明其能有效超分辨率预训练ViT的特征，捕捉因标记化可能被忽略的局部细粒度结构。SRT可应用于任意层与任务，且无需微调。在单目深度预测任务中，该方法优势显著：模型输出集成有害，而将SRT应用于中间ViT特征时，在三种不同架构下，均方根误差（RMSE）和对数均方根误差（RMSE-log）指标分别平均提升4.7%和14.9%。应用于半监督视频目标分割时，SRT在所有指标上均一致超越基线模型，F&J分数平均提升2.4%。进一步研究表明，通过自蒸馏可部分削弱此类量化伪影。在无监督显著区域分割中，SRT在maxF指标上平均提升基线模型2.1%。最后，尽管仅基于像素级特征，SRT仍可泛化至非密集预测任务（如图像检索与目标发现），分别带来稳定提升（最高达2.6%与1.0%）。