Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
翻译:深度特征是计算机视觉研究的基石,能够捕捉图像语义,使学界在零样本或少样本场景下也能解决下游任务。然而,由于模型会在大范围内激进地聚合信息,这些特征往往缺乏直接执行密集预测任务(如分割和深度预测)所需的空间分辨率。在本文中,我们提出FeatUp——一种与任务和模型无关的框架,用于恢复深度特征中丢失的空间信息。我们介绍了FeatUp的两种变体:一种通过单次前向传播以高分辨率信号引导特征,另一种则拟合单个图像的隐式模型以重建任意分辨率的特征。两种方法均采用与NeRF概念深度类同的多视角一致性损失。我们的特征保留原始语义,可直接替换至现有应用中,即便无需重新训练也能提升分辨率与性能。实验表明,在类激活图生成、分割与深度预测的迁移学习、以及语义分割的端到端训练中,FeatUp的性能显著优于其他特征上采样和图像超分辨率方法。