Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR
翻译:基础视觉Transformer(ViT)在需要细粒度空间理解的任务中效果有限,这是由于其固定的预训练分辨率和本质粗糙的块级表示。这些挑战在密集预测场景中尤为突出,例如基于ViT的视觉-语言模型的开放词汇分割任务,其中高分辨率输入对于精确的像素级推理至关重要。现有方法通常采用滑动窗口策略以预训练分辨率处理大分辨率图像,虽然通过更精细的步长提高了准确性,但带来了显著的计算成本。我们提出SPAR:单遍任意分辨率ViT,这是一种分辨率无关的密集特征提取器,专为高效的高分辨率推理而设计。我们通过特征回归损失,将精细步长滑动窗口教师模型的空间推理能力蒸馏到单遍学生模型中,无需架构修改或像素级监督。应用于开放词汇分割时,SPAR将单遍基线方法提升高达10.5 mIoU,甚至超越了教师模型,展示了其在高效高分辨率推理中的有效性。代码:https://github.com/naomikombol/SPAR