Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. In particular, we discover an emergent ability of CLIP, where, by simply drawing a red circle around an object, we can direct the model's attention to that region, while also maintaining global information. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks. Finally, we draw attention to some potential ethical concerns of large language-vision models.
翻译:大规模视觉语言模型(如CLIP)通过学习强大的图像-文本表征,已在从零样本分类到文本生成图像的众多领域得到应用。尽管如此,它们通过提示解决新型判别任务的能力仍落后于大规模语言模型(如GPT-3)。本文探索了视觉提示工程的思想,通过在图像空间而非文本空间进行编辑,解决分类之外的计算机视觉任务。具体而言,我们发现了CLIP的一种涌现能力:只需在物体周围绘制一个红色圆圈,即可引导模型关注该区域,同时保留全局信息。我们通过实现零样本指代表达理解的顶尖性能以及关键点定位任务中的强劲表现,展示了这一简单方法的威力。最后,我们关注了大规模语言-视觉模型可能引发的伦理问题。