Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.
翻译:交错多模态理解与生成能力使模型能够按任意序列生成和解释图像与文本,已成为多模态学习的关键领域。尽管取得了显著进展,但对该能力的评估仍不充分。现有基准在数据规模、覆盖范围和评估深度方面存在局限,而当前评估指标往往成本高昂或存在偏差,在实际应用中缺乏可靠性。为应对这些挑战,我们提出了MMIE——一个用于评估大型视觉语言模型交错多模态理解与生成能力的大规模知识密集型基准。MMIE包含2万个精心构建的多模态查询,涵盖3个类别、12个领域和102个子领域,包括数学、编程、物理、文学、健康与艺术等。它同时支持交错式输入与输出,提供选择题与开放式问题相结合的混合形式,以评估多样化能力。此外,我们提出了一种可靠的自动化评估指标,利用基于人工标注数据微分的评分模型和系统化评估标准,旨在减少偏差并提升评估准确性。大量实验证明我们的基准与指标能为交错式大型视觉语言模型提供全面有效的评估。具体而言,我们对八种大型视觉语言模型进行了评估,结果表明即使最优模型仍有显著改进空间,大多数模型仅取得中等性能。我们相信MMIE将推动交错式大型视觉语言模型的进一步发展。我们已在https://mmie-bench.github.io/公开发布基准数据与代码。