Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code is available at https://github.com/2toinf/IVM.
翻译:指令跟随能力在当代大语言模型中至关重要。然而,当扩展到多模态场景时,模型常面临特定文本指令与图像目标局部区域之间的错位问题。为实现更精准、细致的多模态指令跟随,我们提出了指令引导的视觉掩码生成(IVM),这是一种新颖且通用的视觉定位模型,可与多种多模态模型(如LMM和机器人模型)兼容。通过对指令无关区域构建视觉掩码,经IVM增强的多模态模型能有效聚焦于任务相关的图像区域,从而更好地与复杂指令对齐。具体而言,我们设计了一个视觉掩码数据生成流程,并创建了包含100万个图像-指令对的IVM-Mix-1M数据集。我们进一步提出了一种新的学习技术——判别器加权监督学习(DWSL),用于对IVM进行优先训练,以重点关注高质量数据样本。在通用多模态任务(如视觉问答和具身机器人控制)上的实验结果表明,IVM作为一种即插即用工具具有卓越的通用性,能显著提升多种多模态模型的性能,在具有挑战性的多模态基准测试中取得了全新的最先进结果。代码发布于 https://github.com/2toinf/IVM。