Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.
翻译:大型多模态模型(LMMs)在指令调优技术的推动下取得了显著突破。然而,尽管现有模型能够在整体层面理解图像和视频,它们仍难以实现需要更精细理解与对齐的实例级理解。实例级理解至关重要,因为它聚焦于我们最感兴趣的特定元素。令人振奋的是,现有研究发现,当提供显式视觉线索时,最先进的大型多模态模型展现出强大的实例理解能力。受此启发,我们引入了一个由GPT-4o辅助的自动化标注流程,通过显式视觉提示提取图像和视频中的实例级信息以提供实例引导。基于此流程,我们提出了Inst-IT——一种通过显式视觉提示指令调优来增强大型多模态模型实例理解能力的解决方案。Inst-IT包含三个核心部分:用于诊断多模态实例级理解能力的基准测试集、大规模指令调优数据集,以及能够有效增强现有大型多模态模型时空实例理解能力的持续指令调优训练范式。实验结果表明,在Inst-IT的增强下,我们的模型不仅在Inst-IT基准测试中取得优异性能,还在各类通用图像与视频理解基准上展现出显著提升。这证明我们的数据集不仅能强化实例级理解,还能全面增强通用图像与视频理解的整体能力。