Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
翻译:奖励设计对于利用强化学习解决复杂任务至关重要。近期研究探索使用视觉语言模型(VLMs)生成的图像-文本相似度,通过视觉反馈增强任务奖励。常见做法是将VLM评分线性叠加到任务或成功奖励上,缺乏显式塑形,可能改变最优策略。此外,这类通常依赖单张静态图像的方法,难以处理期望行为涉及跨越多个视觉差异状态的复杂动态运动的任务。单一视角还可能遮挡智能体行为的关键方面。为解决这些问题,本文提出多视角视频奖励塑形(MVR)框架,该框架利用多视角拍摄的视频对目标任务相关状态进行建模。MVR利用冻结预训练VLM生成的视频-文本相似度,学习状态相关性函数,从而缓解基于图像方法对特定静态姿态固有的偏好。此外,我们提出一种状态依赖的奖励塑形公式,整合任务特定奖励与基于VLM的引导,一旦达成期望运动模式即自动减弱VLM引导的影响。通过在HumanoidBench的具挑战性人形运动任务和MetaWorld的操作任务上进行大量实验,我们验证了所提框架的有效性,并通过消融研究证实了设计选择的合理性。