MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.

翻译：生成自然且有意义的响应以处理多模态人类输入，是大规模视觉语言模型（LVLMs）的一项核心能力。当前开源LVLMs在单轮单图像输入等简化场景中展现出良好性能，但在真实对话场景（如在包含多轮对话和多图像的长期上下文历史中遵循指令）中仍显不足。现有LVLM基准主要关注单项选择题或简短回答，无法充分评估LVLMs在真实人机交互应用中的能力。为此，我们提出了MMDU——一个综合性基准，以及MMDU-45k——一个大规模指令微调数据集，旨在评估和提升LVLMs在多轮多图像对话中的能力。我们采用聚类算法从开源维基百科中检索相关图像及文本描述，并在GPT-4o模型的辅助下通过人工标注构建问答对。MMDU最多可包含1.8万个图像+文本标记、20张图像和27轮对话，其长度至少是先前基准的5倍，对现有LVLMs构成显著挑战。基于MMDU对15个代表性LVLMs的深入分析表明，开源模型因对话指令微调数据有限而落后于闭源模型。我们证明，在MMDU-45k上对开源LVLMs进行微调可显著缩小这一差距，生成更长且更准确的对话，并在MMDU及现有基准（MMStar: +1.1%、MathVista: +1.5%、ChartQA: +1.2%）上获得分数提升。我们的工作为弥合当前LVLM模型与现实应用需求之间的差距开辟了道路。本项目公开于https://github.com/Liuziyu77/MMDU。