Contrastively trained text-image models have the remarkable ability to perform zero-shot classification, that is, classifying previously unseen images into categories that the model has never been explicitly trained to identify. However, these zero-shot classifiers need prompt engineering to achieve high accuracy. Prompt engineering typically requires hand-crafting a set of prompts for individual downstream tasks. In this work, we aim to automate this prompt engineering and improve zero-shot accuracy through prompt ensembling. In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?". We demonstrate that this is possible. In doing so, we identify several pathologies in a naive prompt scoring method where the score can be easily overconfident due to biases in pre-training and test data, and we propose a novel prompt scoring method that corrects for the biases. Using our proposed scoring method to create a weighted average prompt ensemble, our method outperforms equal average ensemble, as well as hand-crafted prompts, on ImageNet, 4 of its variants, and 11 fine-grained classification benchmarks, all while being fully automatic, optimization-free, and not requiring access to labeled validation data.
翻译:对比性训练的文本-图像模型具有执行零样本分类的显著能力,即能够对模型从未明确训练过识别类别的未见图像进行分类。然而,这些零样本分类器需要提示工程以达到高精度。提示工程通常需要针对各个下游任务手工制作一组提示。在本工作中,我们旨在自动化这一提示工程,并通过提示集成提高零样本精度。具体而言,我们提出一个问题:“给定一个大型提示池,能否自动对提示评分,并集成那些最适合特定下游数据集的提示,而无需访问标注的验证数据?”我们证明了这是可行的。在此过程中,我们识别了朴素提示评分方法中的若干缺陷,即由于预训练和测试数据中的偏差,评分容易过度自信,并提出了一种新颖的提示评分方法来纠正这些偏差。使用我们提出的评分方法创建加权平均提示集成,我们的方法在ImageNet、其4个变体以及11个细粒度分类基准上均优于等权平均集成和手工制作的提示,同时完全自动化、无需优化且不访问标注验证数据。