Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.
翻译:近期自监督视觉Transformer(ViT)模型(如DINOv3)为密集视觉任务提供了丰富的特征表示。本研究通过免训练基线方法FSSDINO(利用类别特定原型和Gram矩阵优化),探究冻结DINOv3特征固有的小样本语义分割(FSS)能力。我们在二元分类、多类别及跨域(CDFSS)基准测试中的结果表明:这种应用于最终骨干网络层的极简方法,与采用复杂解码器或测试时适应的专用方法相比具有显著竞争力。关键的是,我们通过Oracle引导的层级分析发现:标准末层特征与全局最优中间表示之间存在显著的性能差距。我们揭示了“最安全与最优解”的两难困境:尽管Oracle证明更高性能是可实现的(其效果可与计算密集型适应方法匹敌),但当前无监督及支持集引导的选择指标始终产生低于末层基线的性能。这表征了基础模型中存在的“语义选择鸿沟”——传统启发式方法无法可靠识别高保真特征的脱节现象。本研究确立了“末层特征”作为具有迷惑性的强基线,并对DINOv3中潜在的语义能力进行了严谨诊断。代码公开于https://github.com/hussni0997/fssdino。