Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
翻译:摘要:深度特征是计算机视觉研究的基石,它能够捕捉图像语义,并支持研究社区在零样本或少样本场景下解决下游任务。然而,由于网络会在大区域上进行积极的池化操作,这些特征往往缺乏足够的空间分辨率来直接执行诸如分割和深度预测等密集预测任务。在本工作中,我们提出FeatUp——一个与任务和模型均无关的框架,用于恢复深度特征中丢失的空间信息。我们引入两种FeatUp变体:一种通过单次前向传播用高分辨率信号引导特征,另一种则为单张图像拟合隐式模型以重建任意分辨率的特征。两种方法均采用与NeRF深度类比的多视图一致性损失。我们的特征保留原始语义,可直接替换至现有应用中,即使不重新训练也能提升分辨率与性能。实验表明,FeatUp在类激活图生成、分割与深度预测的迁移学习以及语义分割的端到端训练中,显著优于其他特征上采样和图像超分辨率方法。