Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
翻译:尽管多模态大语言模型(MLLMs)在各类基准测试中取得了显著进展和令人瞩目的表现,但由于现有基准测试的覆盖范围有限,其在真实世界长上下文多图像任务中的有效性仍不明确。当前基准测试通常聚焦于单图像和短文本样本,即便评估多图像任务时,要么限制图像数量,要么仅针对特定任务(如时序描述),这可能导致MLLMs的性能挑战被掩盖。为弥补上述不足,我们提出MileBench——首个专为测试MLLMs多模态长上下文能力而设计的开创性基准测试。该基准不仅包含多模态长上下文任务,还涵盖需要理解与生成能力的多重任务类型。我们构建了诊断性与真实性两类差异化评估数据集,系统评估MLLMs的长上下文适应能力及其在长上下文场景中的任务完成能力。对20个模型的实验结果表明:闭源的GPT-4(Vision)和Gemini 1.5虽表现领先,但多数开源MLLMs在长上下文场景中仍面临挑战。值得注意的是,随着图像数量增加,性能差距呈现扩大趋势。我们强烈呼吁学界加强MLLMs长上下文能力的研究,特别是在涉及多图像的复杂场景中。