AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.
翻译:人工智能体需要规划以实现涉及感知协调、子目标分解与执行的复杂目标。这些规划由按时序执行顺序(TEO,一种确保每个步骤仅在其前提条件满足后执行的有向无环图)结构化的有序步骤组成。现有关于基础模型对时序执行理解的研究局限于自动生成的标注、将TEO近似为线性链或纯文本输入。为填补这一空白,我们提出了MATEO(多模态时序执行顺序)基准,旨在评估和提升大型视觉语言模型(LVLMs)在现实世界规划中所需的时序推理能力。我们通过标准化编辑流程获取高质量专业多模态食谱语料库,该流程将操作说明分解为离散步骤,每个步骤均配有对应图像。通过设计并采用可扩展的众包流程,我们以图结构形式收集TEO标注。基于MATEO,我们在不同模型规模、多样化语言语境、多模态输入结构及微调策略下评估了六种前沿LVLMs。