Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks. For example, providing the prompt "Let's think step by step" improved GPT-3's reasoning accuracy to 63% on MutiArith while prompting "a photo of" filled with a class name enables CLIP to achieve $80$\% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods' generalization ability is worse than text-only prompting tuning. This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization.
翻译:提示工程是一种增强预训练模型在下游任务中性能的强大工具。例如,提供提示“让我们逐步思考”将GPT-3在MutiArith数据集上的推理准确率提升至63%,而包含类别名称的提示“一张...的照片”使CLIP在ImageNet上达到80%的零样本准确率。尽管先前研究探索了视觉模态的提示学习,但针对图像识别中何为良好视觉提示的分析仍十分有限。此外,现有视觉提示调优方法的泛化能力弱于纯文本提示调优。本文揭示了我们的核心见解:合成文本图像是视觉语言模型的有效视觉提示!为实现此目标,我们提出LoGoPrompt方法,将分类目标重新定义为视觉提示选择,并解决了先添加合成文本图像作为类别视觉提示还是先预测类别的“鸡生蛋”难题。无需任何可训练的视觉提示参数,在16个数据集上的实验结果表明,我们的方法在少样本学习、基类到新类泛化及领域泛化方面持续优于现有最优方法。