MANTIS: Interleaved Multi-Image Instruction Tuning

The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective. In this paper, we aim at building strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K instances from 14 multi-image datasets. We design Mantis-Instruct to cover different multi-image skills like co-reference, reasoning, comparing, temporal understanding. We combine Mantis-Instruct with several single-image visual-language datasets to train our model Mantis to handle any interleaved image-text inputs. We evaluate the trained Mantis on five multi-image benchmarks and eight single-image benchmarks. Though only requiring academic-level resources (i.e. 36 hours on 16xA100-40G), Mantis-8B can achieve state-of-the-art performance on all the multi-image benchmarks and beats the existing best multi-image LMM Idefics2-8B by an average of 9 absolute points. We observe that Mantis performs equivalently well on the held-in and held-out evaluation benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis can maintain a strong single-image performance on par with CogVLM and Emu2. Our results are particularly encouraging as it shows that low-cost instruction tuning is indeed much more effective than intensive pre-training in terms of building multi-image LMMs.

翻译：摘要：近年来，大量大型多模态模型（LMMs）有效解决了单图像视觉语言任务。然而，它们解决多图像视觉语言任务的能力仍有待提升。现有多图像LMMs（如OpenFlamingo、Emu、Idefics等）大多通过在网络上对数亿条带噪声的交错图像-文本数据进行预训练来获得多图像能力，这种方式既不高效也不有效。本文旨在利用学术级资源，通过指令微调构建强大的多图像LMMs。为此，我们精心构建了包含来自14个多图像数据集的72.1万个实例的Mantis-Instruct数据集。我们设计Mantis-Instruct涵盖多种多图像技能，如共指、推理、比较和时间理解。我们将Mantis-Instruct与多个单图像视觉语言数据集结合，训练我们的模型Mantis以处理任意交错图像-文本输入。我们在五个多图像基准和八个单图像基准上评估训练后的Mantis。尽管仅需学术级资源（即在16块A100-40G GPU上训练36小时），Mantis-8B在所有多图像基准上均达到最先进性能，并超过现有最佳多图像LMM Idefics2-8B平均9个绝对百分点。我们观察到Mantis在保留和未保留评估基准上表现同样出色。我们进一步在单图像基准上评估Mantis，证明Mantis能够保持与CogVLM和Emu2相当的强大单图像性能。我们的结果尤其令人鼓舞，因为它表明在构建多图像LMMs方面，低成本的指令微调确实比密集预训练有效得多。