Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
翻译:近期,大型多模态模型(LMMs)展现了根据图像相关指令理解图像视觉内容的能力。基于大型语言模型(LLMs)构建的LMMs也继承了其能力与特性,例如在输入提示中包含连贯的图像与文本序列时的上下文学习能力。然而,我们发现现有LMMs存在一个新限制:少量不连贯的图像或文本描述会误导模型仅生成关于被劫持上下文的偏差输出,而非原始预期的上下文。为解决此问题,我们提出一种预过滤方法,利用GPT-4V对上下文内分布偏移的鲁棒性,移除不相关的上下文。我们进一步探究是否可以通过GPT-4V和文本到图像模型,将被劫持的视觉与文本上下文替换为相关上下文,从而生成连贯的响应。