Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.
翻译:强化学习(RL)在多个领域取得了显著成就,但其通常依赖于精心设计的程序化奖励函数来引导智能体行为。设计此类奖励函数具有挑战性,且可能难以在不同任务间良好泛化。为应对这一局限,我们利用预训练视频扩散模型中蕴含的丰富世界知识,为RL智能体提供目标驱动的奖励信号,而无需临时设计奖励函数。我们的核心思想是利用在大规模视频数据集上预训练的现成视频扩散模型,将其作为视频级和帧级目标的信息化奖励函数。对于视频级奖励,我们首先在特定领域数据集上对预训练视频扩散模型进行微调,随后利用其视频编码器评估智能体轨迹的潜在表示与生成的目标视频之间的对齐程度。为实现更细粒度的目标达成,我们通过CLIP从生成视频中识别最相关的帧作为目标状态,从而推导出帧级目标。接着,我们采用一种学习得到的前向-后向表示,该表示反映了从给定状态-动作对访问目标状态的概率,并将其作为帧级奖励,以促进更连贯且目标驱动的轨迹。在多种Meta-World任务上的实验验证了我们方法的有效性。