Video-based representations have gained prominence in planning and decision-making due to their ability to encode rich spatiotemporal dynamics and geometric relationships. These representations enable flexible and generalizable solutions for complex tasks such as object manipulation and navigation. However, existing video planning frameworks often struggle to adapt to failures at interaction time due to their inability to reason about uncertainties in partially observed environments. To overcome these limitations, we introduce a novel framework that integrates interaction-time data into the planning process. Our approach updates model parameters online and filters out previously failed plans during generation. This enables implicit state estimation, allowing the system to adapt dynamically without explicitly modeling unknown state variables. We evaluate our framework through extensive experiments on a new simulated manipulation benchmark, demonstrating its ability to improve replanning performance and advance the field of video-based decision-making.
翻译:基于视频的表征因其能够编码丰富的时空动态与几何关系,在规划与决策领域日益受到重视。此类表征为物体操控与导航等复杂任务提供了灵活且泛化性强的解决方案。然而,现有的视频规划框架在交互时往往难以适应执行失败的情况,原因在于其无法对部分可观测环境中的不确定性进行推理。为克服这些局限,本文提出一种将交互时数据整合至规划过程的新框架。该方法在线更新模型参数,并在生成过程中过滤先前失败的规划方案,从而实现隐式状态估计,使系统能够动态适应环境,而无需显式建模未知的状态变量。我们在一个新的仿真操控基准测试上通过大量实验评估了所提框架,结果表明其能够有效提升重规划性能,并推动基于视频的决策研究领域的发展。