Large models such as Vision Transformers (ViTs) have demonstrated remarkable superiority over smaller architectures like ResNet in few-shot classification, owing to their powerful representational capacity. However, fine-tuning such large models demands extensive GPU memory and prolonged training time, making them impractical for many real-world low-resource scenarios. To bridge this gap, we propose EfficientFSL, a query-only fine-tuning framework tailored specifically for few-shot classification with ViT, which achieves competitive performance while significantly reducing computational overhead. EfficientFSL fully leverages the knowledge embedded in the pre-trained model and its strong comprehension ability, achieving high classification accuracy with an extremely small number of tunable parameters. Specifically, we introduce a lightweight trainable Forward Block to synthesize task-specific queries that extract informative features from the intermediate representations of the pre-trained model in a query-only manner. We further propose a Combine Block to fuse multi-layer outputs, enhancing the depth and robustness of feature representations. Finally, a Support-Query Attention Block mitigates distribution shift by adjusting prototypes to align with the query set distribution. With minimal trainable parameters, EfficientFSL achieves state-of-the-art performance on four in-domain few-shot datasets and six cross-domain datasets, demonstrating its effectiveness in real-world applications.
翻译:大型模型(如视觉Transformer,ViT)凭借其强大的表征能力,在少样本分类任务中展现出相较于ResNet等较小架构的显著优势。然而,对此类大型模型进行微调需要大量的GPU内存和较长的训练时间,使其在许多现实世界低资源场景中难以实际应用。为弥补这一差距,我们提出了EfficientFSL,一种专为ViT少样本分类设计的仅查询微调框架,该框架在显著降低计算开销的同时实现了具有竞争力的性能。EfficientFSL充分利用了预训练模型中嵌入的知识及其强大的理解能力,以极少的可调参数实现了高分类精度。具体而言,我们引入了一个轻量级可训练的前向块,用于合成任务特定的查询,以仅查询的方式从预训练模型的中间表示中提取信息丰富的特征。我们进一步提出了一个组合块,用于融合多层输出,从而增强特征表示的深度和鲁棒性。最后,一个支持-查询注意力块通过调整原型以对齐查询集分布,从而缓解分布偏移问题。凭借极少的可训练参数,EfficientFSL在四个域内少样本数据集和六个跨域数据集上实现了最先进的性能,证明了其在现实应用中的有效性。