With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.
翻译:随着大规模基础模型的兴起,如何高效地将其适配至下游任务仍是一个核心挑战。线性探测方法通过冻结主干网络并训练轻量级头部,在计算上具有高效性,但通常仅限于利用最后一层的表征。我们的研究表明,任务相关信息分布在网络层级结构中,而非仅编码于任何单一的最后层。为利用这种信息分布,我们应用了一种注意力探测机制,动态融合视觉Transformer所有层的表征。该机制学习识别目标任务最相关的层级,并将低层次的结构线索与高层次的语义抽象相结合。在20个多样化数据集及多个预训练基础模型上的实验表明,我们的方法相较于标准线性探测取得了持续且显著的性能提升。注意力热图进一步揭示,与预训练领域不同的任务最能受益于中间层表征。总体而言,我们的研究结果强调了中间层信息的价值,并展示了一种基于原理、任务感知的方法,以释放其在探测式适配中的潜力。