Video-based representations have gained prominence in planning and decision-making due to their ability to encode rich spatiotemporal dynamics and geometric relationships. These representations enable flexible and generalizable solutions for complex tasks such as object manipulation and navigation. However, existing video planning frameworks often struggle to adapt to failures at interaction time due to their inability to reason about uncertainties in partially observed environments. To overcome these limitations, we introduce a novel framework that integrates interaction-time data into the planning process. Our approach updates model parameters online and filters out previously failed plans during generation. This enables implicit state estimation, allowing the system to adapt dynamically without explicitly modeling unknown state variables. We evaluate our framework through extensive experiments on a new simulated manipulation benchmark, demonstrating its ability to improve replanning performance and advance the field of video-based decision-making.
翻译:基于视频的表征方法因其能够编码丰富的时空动态与几何关系,在规划与决策领域日益受到重视。这类表征为实现物体操控与导航等复杂任务提供了灵活且可泛化的解决方案。然而,现有的视频规划框架由于难以对部分可观测环境中的不确定性进行推理,常常难以在交互时适应任务失败的情况。为克服这些局限,我们提出了一种将交互时数据整合到规划过程中的新框架。该方法在线更新模型参数,并在生成过程中过滤掉先前失败的规划方案。这实现了隐式状态估计,使系统能够动态适应环境,而无需显式建模未知的状态变量。我们在一个新的仿真操控基准测试上进行了大量实验,验证了该框架提升重规划性能的能力,并推动了基于视频的决策研究领域的发展。