As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.
翻译:经验观测表明,视觉Transformer(ViTs)对输入令牌的顺序相当不敏感,因此有必要设计合适的自监督预训练任务来增强ViTs的位置感知能力。为此,我们提出DropPos——一种旨在重建丢弃位置的新型预训练任务。DropPos的公式化定义简单:首先随机丢弃大量位置嵌入,而后模型仅依据各不重叠图像块的视觉外观,在所有可能位置中为其分类实际位置。为避免琐碎解,我们通过仅保留部分可见图像块来增加任务难度。此外,考虑到不同图像块可能具有相似的视觉外观,我们提出位置平滑与注意力重建策略来松弛该分类问题——因为在此类情况下无需重建其精确位置。DropPos的实证评估展现出强大能力:在多种下游基准测试中,DropPos优于监督预训练,并与当前最优自监督方法取得竞争性结果。这表明显式鼓励空间推理能力(如DropPos所做)确实有助于提升ViTs的位置感知能力。代码已开源至https://github.com/Haochen-Wang409/DropPos。