We introduce VISUAL EMBEDDED INSTRUCTION (VIM), a new framework designed to evaluate the visual instruction following capability of Multimodal Large Language Models (MLLMs). As illustrated in Figure 2, VIM challenges the MLLMs by embedding the instructions into the visual scenes, demanding strong visual interpretative skills for instruction following. We adapt VIM to various benchmarks, including VQAv2, MME, MM-Vet, and RefCOCO series, compose a VIM bench, and probe diverse MLLMs across three distinct in-context learning settings: Zero Shot, One Shot, and Pair Shot. We observe that there is a significant performance disparity between the open-source MLLMs and GPT-4V, implying that their proficiency in visual instruction comprehension is not up to par. Our results highlight a promising direction for the enhancement of MLLMs capabilities on instruction following. We aim VIM to serve as a useful norm for advancing the state of the art and driving further progress in the field.
翻译:我们提出视觉嵌入式指令(VIM),这是一个用于评估多模态大语言模型(MLLMs)视觉指令遵循能力的新框架。如图2所示,VIM通过将指令嵌入视觉场景来挑战MLLMs,要求其具备强大的视觉解读能力以遵循指令。我们将VIM适配至包括VQAv2、MME、MM-Vet和RefCOCO系列在内的多种基准,构建了VIM基准测试集,并在零样本、单样本和对样本三种不同的上下文学习设置下探测了多种MLLMs。我们观察到开源MLLMs与GPT-4V之间存在显著性能差距,表明它们在视觉指令理解方面的能力尚不达标。我们的结果揭示了提升MLLMs指令遵循能力的一个有前景的方向。我们希望VIM能成为推动该领域技术发展及进一步提升的实用标准。