MANTIS: Interleaved Multi-Image Instruction Tuning

Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.

翻译：大型多模态模型（LMMs）在单图像视觉语言任务中已展现出卓越性能。然而，其在解决多图像视觉语言任务方面的能力仍有待提升。现有模型如OpenFlamingo、Emu2和Idefics通过在海量网络爬取的交错图像-文本数据上进行预训练来获得多图像理解能力，这种方法既低效又不够理想。本文旨在通过学术级资源规模的指令微调构建强大的多图像LMMs。为此，我们精心构建了包含72.1万条多图像指令数据的Mantis-Instruct数据集，并以此训练了Mantis系列模型。指令微调使Mantis获得了共指消解、对比分析、逻辑推理和时间理解等多图像能力。我们在8个多图像基准测试和6个单图像基准测试上评估Mantis。实验表明，Mantis-Idefics2在所有多图像基准测试中均达到最先进水平，较最强基线模型Idefics2-8B平均提升13个绝对百分点。值得注意的是，Idefics2-8B使用了1.4亿条交错多图像数据进行预训练，其数据规模是Mantis-Instruct的200倍。我们观察到Mantis在训练分布内与分布外基准测试上表现相当，这证明了其泛化能力。进一步在单图像基准测试上的评估显示，Mantis仍能保持与CogVLM和Emu2相当的单图像性能。我们的研究表明，多图像能力并非必须通过大规模预训练获得，低成本指令微调同样可以实现。Mantis的训练与评估为未来提升LMMs多图像能力的研究开辟了新路径。