Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.
翻译:精确的时序预测是全面场景理解与具身人工智能之间的桥梁。然而,对于视觉语言模型而言,在多个时间尺度上预测场景的多个细粒度状态具有挑战性。我们通过将多尺度分解为两个正交维度,形式化定义了通用场景与手术场景中的多尺度时序预测任务:时间尺度维度,即在不同的前瞻时间间隔预测人类与手术的状态;状态尺度维度,即对通用场景与手术场景中的状态层次结构进行建模。例如,在通用场景中,接触关系状态比空间关系状态更为细粒度。在手术场景中,中等层级的手术步骤比高层级的手术阶段更为细粒度,但仍受其所属阶段的约束。为支持这一统一任务,我们引入了首个MSTP基准测试,其特点在于跨多个状态尺度与时间尺度的同步标注。我们进一步提出一种方法——增量生成与多智能体协作,该方法融合了两项关键创新。首先,我们提出一个即插即用的增量生成模块,该模块持续合成扩展时间尺度上的最新视觉预览,为多个决策智能体提供信息,保持决策与生成视觉内容的同步,并防止随着前瞻时间间隔延长而导致的性能下降。其次,我们提出一个用于多状态预测的决策驱动多智能体协作框架,该框架包含生成、启动与多状态评估智能体,它们动态触发并评估预测周期,以平衡全局一致性与局部保真度。