Multimodal large language models (MLLMs) have shown great potential in perception and interpretation tasks, but their capabilities in predictive reasoning remain under-explored. To address this gap, we introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios. Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction. We further develop three evaluation methods powered by large language model to robustly quantify a model's performance in predicting and reasoning the future based on multi-visual context. Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods via rigorous testing and reveal pros and cons of current popular MLLMs in the task of predictive reasoning. Lastly, our proposed benchmark provides a standardized evaluation framework for MLLMs and can facilitate the development of more advanced models that can reason and predict over complex long sequence of multimodal input.
翻译:多模态大语言模型在感知与解释任务中展现出巨大潜力,但其在预测推理方面的能力仍待深入探究。为填补这一空白,我们引入一个新颖的基准测试,用以评估多模态大语言模型在不同场景下的预测推理能力。该基准聚焦三大重要领域:抽象模式推理、人类活动预测及物理交互预测。我们进一步开发了三种由大型语言模型驱动的评估方法,以稳健量化模型基于多视觉上下文进行未来预测与推理的性能。通过严格测试,实验验证了所提基准与评估方法的合理性,并揭示了当前主流多模态大语言模型在预测推理任务中的优劣。最终,本基准为多模态大语言模型提供了一套标准化评估框架,有助于推动能对复杂长序列多模态输入进行推理与预测的更先进模型的发展。