Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs' potential under data-limited scenarios (e.g., few-shot tuning) remains a challenge due to FViTs' data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness: 0.04% ~ 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods.
翻译:尽管针对下游任务对基础视觉Transformer(FViTs)进行调优的需求日益增长,但由于FViTs本身的数据密集特性,在数据受限场景(如少样本调优)中充分释放其潜力仍是一项挑战。常见的数据增强技术在此情境中效果有限,因为少样本调优数据包含的特征有限。为应对这一挑战,我们首先发现FViTs在少样本调优中的一个机遇:预训练的FViTs已从大规模预训练数据中学习到高度代表性的特征,而这些特征在广泛使用的参数高效调优过程中得以完全保留。因此,我们假设利用这些已学特征来增强调优数据,可以提升少样本FViT调优的有效性。为此,我们提出一种名为基于提示的数据增强(Hint-Aug)框架,旨在通过利用预训练FViTs的已学特征增强调优样本中过拟合部分,从而提升FViT在少样本调优中的性能。具体而言,Hint-Aug集成了两个关键组件:(1)注意力过拟合检测器(AOD),用于检测基础ViT中过度自信的块,以潜在缓解其在少样本调优数据上的过拟合;(2)基于混淆的特征注入(CFI)模块,用于将预训练FViTs中易混淆的特征注入由上述AOD检测到的过度自信块,从而增强调优过程中的特征多样性。在五个数据集和三种参数高效调优技术上的大量实验和消融研究一致验证了Hint-Aug的有效性:在各种低样本设置下,其准确率比最先进(SOTA)数据增强方法高出0.04%至32.91%。例如,在Pet数据集上,Hint-Aug仅使用50%的训练数据即可比SOTA数据增强方法实现2.22%的准确率提升。