We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.
翻译:本文提出AnyUp,一种适用于任意视觉特征、任意分辨率的上采样方法,无需针对特定编码器进行训练。现有的基于学习的特征上采样方法(如针对DINO或CLIP特征)需要为每个特征提取器重新训练,因此在推理时无法泛化至不同类型的特征。本研究提出一种推理时特征无关的上采样架构,以缓解这一局限并提升上采样质量。实验表明,AnyUp在特征上采样任务中取得了最先进的性能,能够泛化至不同特征类型,在保持特征语义的同时兼具高效性,并可便捷地应用于广泛的下游任务。