Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.
翻译:序列决策可被形式化为文本条件视频生成问题:视频规划器在文本定义目标的引导下,生成可视化规划动作的未来帧,随后从这些帧中推导出控制动作。本文提出面向通用策略的主动区域视频扩散(ARDuP),这是一种基于视频的策略学习新框架,其核心在于生成主动区域(即潜在交互区域),从而增强条件策略对任务执行关键交互区域的关注。该创新框架将主动区域条件与潜在扩散模型相结合以进行视频规划,并在逆动力学建模中采用潜在表征直接解码动作。通过利用视频中的运动线索自动发现主动区域,本方法无需人工标注主动区域。我们在仿真环境CLIPort和真实世界数据集BridgeData v2上进行了大量实验,验证了ARDuP的有效性,其在成功率方面取得了显著提升,并能生成具有说服力的逼真视频规划。