Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.
翻译:多模态大语言模型(MLLMs)日益被视为具身智能体的基础,但其能否从自我中心视角可靠地推理行动的长时物理后果,目前尚不明确。我们通过一项新任务——基于长时程推理的自我中心场景预测——来研究这一差距:给定一张初始场景图像和一系列原子化行动描述,模型需要预测所有行动执行完毕后的最终场景。为支持系统性评估,我们引入了EXPLORE-Bench,这是一个从涵盖多样化场景的真实第一人称视频中构建的基准数据集。每个实例都将长行动序列与结构化的最终场景标注配对,包括物体类别、视觉属性以及物体间关系,从而支持细粒度的定量评估。在一系列专有和开源MLLMs上的实验揭示了其与人类表现之间存在显著差距,表明长时程自我中心推理仍是一个重大挑战。我们进一步通过逐步推理分析了测试时的扩展性,结果表明,将长行动序列分解可以在一定程度上提升性能,但会带来不小的计算开销。总体而言,EXPLORE-Bench为衡量和推进自我中心具身感知中的长时程推理提供了一个原则性的测试平台。