Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 22 models, revealed that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
翻译:摘要:尽管多模态大语言模型(MLLMs)在各类基准测试中取得了显著进展和优异性能,但由于现有测试基准的局限性,其在真实世界、长上下文及多图像任务中的有效性仍不明确。当前基准测试通常聚焦于单图像和短文本样本,而在评估多图像任务时,要么限制图像数量,要么集中于特定任务(如时序描述),这可能会掩盖MLLMs面临的性能挑战。为应对上述局限,我们提出了MileBench——一个旨在测试MLLMs多模态长上下文能力的开创性基准测试。该基准不仅包含多模态长上下文,还涵盖了需要理解与生成能力的多项任务。我们构建了诊断性与真实性两套独立评估集,以系统性地评估MLLMs的长上下文适应能力及其在长上下文场景中完成任务的能力。通过对22个模型的实验测试发现,尽管闭源的GPT-4o表现优于其他模型,但大多数开源MLLMs在长上下文场景中仍存在明显不足。值得注意的是,随着图像数量的增加,性能差距呈扩大趋势。我们强烈呼吁加强对MLLMs长上下文能力的研发投入,尤其是涉及多图像的应用场景。