Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.
翻译:视觉-语言模型(VLMs),如Flamingo和GPT-4V,通过将大型语言模型与视觉系统集成,展现出巨大潜力。然而,这些模型在物体定位这一基础计算机视觉任务中面临挑战,原因在于它们训练所使用的多模态数据主要包含缺乏显式空间定位的标题文本。尽管可以利用边界框注释构建与VLMs集成的定制监督训练流程,但这会导致模型专业化且难以扩展。本文旨在探索基于标题的VLMs的极限,并转而以更简单的方式应对挑战:i)保持基于标题的VLMs权重冻结,ii)不使用任何监督检测数据。为此,我们引入了一种与输入无关的可学习空间提示——位置插入(PIN),它包含一组极少量参数,被嵌入冻结的VLMs内部,从而解锁物体定位能力。我们的PIN模块通过合成数据上的简单下一词元预测任务进行训练,无需引入新的输出头。实验表明,该方法在多种图像(包括Pascal VOC、COCO、LVIS以及绘画或卡通等多样图像)上展现出强大的零样本定位性能。