Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .
翻译:近年来,大语言模型取得了巨大成功,其视觉变体同样如此。现有视觉语言模型能够用自然语言描述图像、回答视觉相关问题,或对图像进行复杂推理。然而,如何利用大语言模型执行定位任务(如词元定位或指涉定位)仍不明确。本研究旨在开发一种视觉语言模型,能够将位置信息(例如点集或边界框)作为输入或输出。当位置作为输入时,模型执行位置条件描述,为指定物体或区域生成描述文本;当位置作为输出时,模型对语言模型生成的每个词汇回归像素坐标,从而实现密集词元定位。本模型在本地化叙事数据集上预训练,该数据集包含人类注意力标注的像素-词元对齐描述。实验表明,本模型可应用于多种位置感知视觉语言任务,包括指涉定位、位置条件描述及密集目标描述,在RefCOCO和Visual Genome数据集上达到当前最优性能。项目主页:https://jerryxu.net/PixelLLM 。