Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase on a set of base classes. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters. While these methods have shown promise, inconsistencies in experimental conditions make it difficult to disentangle their advantage from other experimental factors including the feature extractor architecture, pre-trained initialization and fine-tuning algorithm, amongst others. In our paper, we conduct a large-scale, experimentally consistent, empirical analysis to study PEFTs for few-shot image classification. Through a battery of over 1.8k controlled experiments on large-scale few-shot benchmarks including Meta-Dataset (MD) and ORBIT, we uncover novel insights on PEFTs that cast light on their efficacy in fine-tuning ViTs for few-shot classification. Through our controlled empirical study, we have two main findings: (i) Fine-tuning just the LayerNorm parameters (which we call LN-Tune) during few-shot adaptation is an extremely strong baseline across ViTs pre-trained with both self-supervised and supervised objectives, (ii) For self-supervised ViTs, we find that simply learning a set of scaling parameters for each attention matrix (which we call AttnScale) along with a domain-residual adapter (DRA) module leads to state-of-the-art performance (while being $\sim\!$ 9$\times$ more parameter-efficient) on MD. Our extensive empirical findings set strong baselines and call for rethinking the current design of PEFT methods for FSC.
翻译:小样本分类(FSC)要求在基类集预训练(或元训练)阶段后,仅凭每个新类少数样本学习新类别。近期研究表明,仅对预训练视觉Transformer(ViT)在新测试类上进行微调是FSC的有效方法。然而,微调ViT在时间、计算和存储方面成本高昂,这促使了参数高效微调(PEFT)方法的设计——仅微调Transformer中部分参数。尽管这些方法展现出潜力,但实验条件的不一致性使得难以将其优势与其他实验因素(包括特征提取器架构、预训练初始化、微调算法等)分离开来。本文通过大规模、实验一致的实证分析,系统研究PEFT在小样本图像分类中的应用。通过在Meta-Dataset(MD)和ORBIT等大规模小样本基准上开展超过1800次受控实验,我们获得了关于PEFT的新见解,揭示了其对ViT小样本分类微调有效性的作用机理。通过受控实证研究,我们得出两项主要发现:(i)在小样本适配阶段仅微调层归一化参数(称为LN-Tune),对于使用自监督和/or监督目标预训练的ViT均构成极强基线;(ii)对于自监督ViT,仅学习每个注意力矩阵的缩放参数(称为AttnScale)并结合域残差适配器(DRA)模块,即可在MD上取得最优性能(同时参数效率提升约9倍)。我们广泛的实证结果为FSC领域奠定了强基线基准,并呼吁重新审视当前PEFT方法的设计范式。