For downstream applications of vision-language pre-trained models, there has been significant interest in constructing effective prompts. Existing works on prompt engineering, which either require laborious manual designs or optimize the prompt tuning as a point estimation problem, may fail to describe diverse characteristics of categories and limit their applications. We introduce a Bayesian probabilistic resolution to prompt tuning, where the label-specific stochastic prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model. Importantly, we semantically regularize the tuning process by minimizing the statistical distance between the visual patches and linguistic prompts, which pushes the stochastic label representations to faithfully capture diverse visual concepts, instead of overfitting the training categories. We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts. Extensive results over 15 datasets show promising transferability and generalization performance of our proposed model, both quantitatively and qualitatively.
翻译:针对视觉语言预训练模型的下游应用,构建有效的提示已成为重要研究方向。现有提示工程方法或依赖繁重的人工设计,或将提示调优作为点估计问题进行优化,可能无法描述类别的多样化特征并限制其应用范围。本文提出一种贝叶斯概率框架下的提示调优方法,通过分层生成标签特定的随机提示:首先从潜在分布中采样隐向量,随后采用轻量级生成模型。关键创新在于通过最小化视觉补丁与语言提示之间的统计距离对调优过程进行语义正则化,促使随机标签表征能够忠实捕捉多样化的视觉概念,而非过拟合训练类别。我们在四个任务上评估方法的有效性:少样本图像识别、基类到新类的泛化、数据集迁移学习以及域适应任务。在15个数据集上的大量实验结果表明,所提模型在定量与定性分析中均展现出优异的可迁移性与泛化性能。