Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs' capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models' ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.
翻译:与大型语言模型相比,大型视觉语言模型还能接收图像作为输入,从而展现出更有趣的涌现能力,并在各类视觉语言任务上表现出令人印象深刻的性能。受LLMs中文本提示技术的启发,视觉提示技术已被探索用于增强LVLMs感知视觉信息的能力。然而,先前的视觉提示技术仅处理视觉输入而未考虑文本查询,限制了模型遵循文本指令完成任务的能力。为填补这一空白,本研究提出了一种名为“图像注意力提示”的新提示技术,该方法仅需将文本查询引导的注意力热图叠加于原始输入图像之上,即可有效增强LVLM在各类任务上的表现。具体而言,我们借助CLIP等辅助模型为输入图像生成依赖于文本查询的注意力热图,随后将该热图与原始图像的像素值简单相乘,即可得到LVLM的实际输入图像。在多种视觉语言基准测试上的大量实验验证了本技术的有效性。例如,图像注意力提示技术将LLaVA-1.5在MM-Vet和LLaVA-Wild基准上的性能分别提升了3.8%和2.9%。