Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.
翻译:多模态大语言模型(MLLMs)已在处理多种视觉-语言任务中展现出卓越能力。然而,当前MLLM基准测试主要基于单张图像的静态信息评估推理能力,而现代MLLM从图像序列中推断信息的能力——这一能力对理解我们不断变化的世界至关重要——却鲜少被研究。为应对这一挑战,本文提出Mementos,一个旨在评估MLLM序列图像推理能力的新基准。Mementos包含4,761个长度各异的多样化图像序列。我们还采用GPT-4辅助方法评估MLLM的推理性能。通过对包括GPT-4V和Gemini在内的九种近期MLLM在Mementos上的仔细评估,我们发现它们在准确描述给定图像序列的动态信息方面存在困难,往往导致对物体及其相应行为的幻觉或误判。我们的定量分析和案例研究揭示了影响MLLM序列图像推理的三个关键因素:物体幻觉与行为幻觉之间的相关性、共现行为的影响以及行为幻觉的复合效应。我们的数据集可通过https://github.com/umd-huang-lab/Mementos获取。