Humans possess the remarkable ability to foresee the future to a certain extent based on present observations, a skill we term as foresight minds. However, this capability remains largely under explored within existing Multimodal Large Language Models (MLLMs), hindering their capacity to learn the fundamental principles of how things operate and the intentions behind the observed subjects. To address this issue, we introduce the integration of future modeling into the existing learning frameworks of MLLMs. By utilizing the subject trajectory, a highly structured representation of a consecutive frame sequence, as a learning objective, we aim to bridge the gap between the past and the future. We propose two innovative methods to empower MLLMs with foresight minds, Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT), which are inspired by the modern learning paradigm of LLMs. Specifically, FPT jointly training various tasks centered on trajectories, enabling MLLMs to learn how to attend and predict entire trajectories from a given initial observation. Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them. Aided by FPT and FIT, we build a novel and unified MLLM named Merlin that supports multi-images input and analysis about potential actions of multiple objects for the future reasoning. Experimental results show Merlin powerful foresight minds with impressive performance on both future reasoning and visual comprehension tasks.
翻译:人类具备基于当前观察在一定程度上预见未来的卓越能力,我们称之为前瞻思维。然而,现有多模态大语言模型(MLLMs)中这一能力在很大程度上尚未得到充分探索,这阻碍了其学习事物运作基本原理及观察对象背后意图的能力。为解决这一问题,我们将未来建模引入现有MLLMs的学习框架。通过以主体轨迹——一种连续帧序列的高度结构化表征——作为学习目标,我们旨在弥合过去与未来之间的鸿沟。受现代大语言模型学习范式的启发,我们提出两种创新方法以赋予MLLMs前瞻思维:前瞻预训练(FPT)与前瞻指令微调(FIT)。具体而言,FPT联合训练以轨迹为核心的多种任务,使MLLMs能够学习如何从给定初始观察中关注并预测完整轨迹。随后,FIT要求MLLMs首先预测相关对象的轨迹,进而基于这些轨迹推理潜在未来事件。借助FPT与FIT,我们构建了名为Merlin的新型统一MLLM,其支持多图像输入及针对未来推理的多对象潜在行为分析。实验结果表明,Merlin具备强大的前瞻思维,在未来推理与视觉理解任务上均展现出令人印象深刻的性能。