Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT
翻译:视觉指令调优在提升大型多模态模型的能力方面已取得显著进展。然而,现有的开源LMM主要集中于单图像任务,其在多图像场景下的应用仍较少被探索。此外,先前LMM研究通常分别处理不同场景,导致模型难以将新涌现的能力泛化至跨场景任务。为此,我们提出了LLaVA-NeXT-Interleave,该模型能够同时处理LMM中的多图像、多帧(视频)、多视角(3D)以及多分块(单图像)场景。为实现这些能力,我们将交错数据格式视为通用模板,并构建了包含117.6万样本的M4-Instruct数据集,涵盖4个主要领域、14项任务和41个数据集。我们还精心设计了LLaVA-Interleave Bench,以全面评估LMM在多图像场景下的性能。通过大量实验,LLaVA-NeXT-Interleave在多图像、视频和3D基准测试中取得了领先结果,同时保持了单图像任务的性能。此外,我们的模型还展现出多项新兴能力,例如在不同设置与模态间迁移任务。代码发布于https://github.com/LLaVA-VL/LLaVA-NeXT。