This paper introduces FUNGI, Features from UNsupervised GradIents, a method to enhance the features of transformer encoders by leveraging self-supervised gradients. Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input. These gradients are projected to a lower dimension and then concatenated with the model's output embedding. The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio. Across backbones spanning various sizes and pretraining strategies, FUNGI features provide consistent performance improvements over the embeddings. We also show that using FUNGI features can benefit linear classification, clustering and image retrieval, and that they significantly improve the retrieval-based in-context scene understanding abilities of pretrained models, for example improving upon DINO by +17% for semantic segmentation - without any training.
翻译:本文提出FUNGI(Features from UNsupervised GradIents)方法,通过利用自监督梯度增强Transformer编码器的特征表示。我们的方法简洁明了:给定任意预训练模型,首先针对每个输入计算来自不同自监督目标的梯度。这些梯度被投影至低维空间后,与模型输出的嵌入向量进行拼接。所得特征在涵盖视觉领域的11个数据集、自然语言处理领域的5个数据集及音频领域的2个数据集上,通过k近邻分类进行评估。在涵盖不同规模与预训练策略的骨干网络中,FUNGI特征相较于原始嵌入向量均能带来持续的性能提升。我们还证明,使用FUNGI特征可提升线性分类、聚类与图像检索任务的性能,并显著增强预训练模型基于检索的上下文场景理解能力——例如在无需任何训练的情况下,将DINO模型在语义分割任务上的性能提升17%。