As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.
翻译:经验观察表明,视觉Transformer(ViTs)对输入令牌的顺序相当不敏感,因此设计合适的自监督预训练任务以增强ViT的位置感知能力变得愈发重要。为解决这一问题,我们提出了DropPos——一种新颖的预训练任务,旨在重建被丢弃的位置。DropPos的构建方法简单直接:首先随机丢弃大量位置嵌入,随后要求模型仅依据各非重叠补丁的视觉外观,从所有可能位置中分类其实际位置。为避免模型找到简单解,我们通过仅保留部分可见补丁来增加任务难度。此外,考虑到可能存在视觉外观相似的补丁,本文提出位置平滑与注意力重构策略来缓解分类问题的苛刻性——因为在这些情况下无需重建其精确位置。实验评估表明DropPos具有强大能力:在多个下游基准测试中,DropPos不仅优于有监督预训练,还与当前最先进的自监督方法取得竞争性结果。这证实了如DropPos所采用的显式鼓励空间推理能力,确实有助于提升ViT的位置感知性能。代码已开源发布于https://github.com/Haochen-Wang409/DropPos。