We present a simple self-supervised method to enhance the performance of ViT features for dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and compact postprocessing network that can be applied to enhance the features of any pre-trained ViT backbone. LiFT is fast and easy to train with a self-supervised objective, and it boosts the density of ViT features for minimal extra inference cost. Furthermore, we demonstrate that LiFT can be applied with approaches that use additional task-specific downstream modules, as we integrate LiFT with ViTDet for COCO detection and segmentation. Despite the simplicity of LiFT, we find that it is not simply learning a more complex version of bilinear interpolation. Instead, our LiFT training protocol leads to several desirable emergent properties that benefit ViT features in dense downstream tasks. This includes greater scale invariance for features, and better object boundary maps. By simply training LiFT for a few epochs, we show improved performance on keypoint correspondence, detection, segmentation, and object discovery tasks. Overall, LiFT provides an easy way to unlock the benefits of denser feature arrays for a fraction of the computational cost. For more details, refer to our project page at https://www.cs.umd.edu/~sakshams/LiFT/.
翻译:我们提出了一种简单的自监督方法,以提升ViT特征在密集下游任务中的性能。我们的轻量级特征变换(LiFT)是一种直接且紧凑的后处理网络,可应用于增强任意预训练ViT骨架的特征。LiFT训练快速且易于使用自监督目标进行,它以极小的额外推理代价提升了ViT特征的密度。此外,我们证明LiFT可与使用额外任务特定下游模块的方法结合应用,例如我们将LiFT与ViTDet集成用于COCO检测与分割任务。尽管LiFT简单,但我们发现它并非简单地学习双线性插值的复杂版本。相反,我们的LiFT训练协议产生了若干有益于密集下游任务中ViT特征的可取涌现特性,包括更好的特征尺度不变性以及更优的目标边界图。通过仅训练LiFT数个周期,我们展示了在关键点对应、检测、分割及目标发现任务上的性能提升。总体而言,LiFT提供了一种以极低计算成本解锁密集特征数组优势的简易方法。更多细节请参见我们的项目页面:https://www.cs.umd.edu/~sakshams/LiFT/。