Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
翻译:大规模预训练视觉-语言模型的最新进展在零样本下游任务中展现出卓越性能。在此基础上,近期研究(如CoOp和CoCoOp)提出了提示学习方法,将提示中的上下文替换为可学习向量,相较于人工设计的提示取得了显著提升。然而,对未见类别的性能提升仍较为有限,为解决这一问题,传统零样本学习技术中常采用数据增强方法。通过实验,我们发现了CoOp和CoCoOp中的关键问题:通过传统图像增强学习到的上下文对已见类别存在偏差,这不利于对未见类别的泛化。为应对此问题,我们提出对抗性token嵌入方法,在诱导可学习提示产生偏差时,将低层视觉增强特征与高层类别信息解耦。通过名为"为提示学习添加属性"(AAPL)的新机制,我们引导可学习上下文通过聚焦于未见类别的高层特征来有效提取文本特征。我们在11个数据集上进行了实验,总体结果表明,AAPL在少样本学习、零样本学习、跨数据集及域泛化任务中均展现出优于现有方法的性能。