Learning behavior in legged robots presents a significant challenge due to its inherent instability and complex constraints. Recent research has proposed the use of a large language model (LLM) to generate reward functions in reinforcement learning, thereby replacing the need for manually designed rewards by experts. However, this approach, which relies on textual descriptions to define learning objectives, fails to achieve controllable and precise behavior learning with clear directionality. In this paper, we introduce a new video2reward method, which directly generates reward functions from videos depicting the behaviors to be mimicked and learned. Specifically, we first process videos containing the target behaviors, converting the motion information of individuals in the videos into keypoint trajectories represented as coordinates through a video2text transforming module. These trajectories are then fed into an LLM to generate the reward function, which in turn is used to train the policy. To enhance the quality of the reward function, we develop a video-assisted iterative reward refinement scheme that visually assesses the learned behaviors and provides textual feedback to the LLM. This feedback guides the LLM to continually refine the reward function, ultimately facilitating more efficient behavior learning. Experimental results on tasks involving bipedal and quadrupedal robot motion control demonstrate that our method surpasses the performance of state-of-the-art LLM-based reward generation methods by over 37.6% in terms of human normalized score. More importantly, by switching video inputs, we find our method can rapidly learn diverse motion behaviors such as walking and running.
翻译:由于腿式机器人固有的不稳定性与复杂约束,其行为学习面临重大挑战。近期研究提出利用大语言模型(LLM)生成强化学习中的奖励函数,从而替代专家手动设计奖励的需求。然而,这种依赖文本描述来定义学习目标的方法,难以实现具有明确方向性的可控且精确的行为学习。本文提出一种新的video2reward方法,能够直接从描述待模仿与学习行为的视频中生成奖励函数。具体而言,我们首先对包含目标行为的视频进行处理,通过视频转文本转换模块将视频中个体的运动信息转化为以坐标表示的关键点轨迹。随后将这些轨迹输入LLM以生成奖励函数,并用于策略训练。为提升奖励函数的质量,我们开发了视频辅助的迭代奖励优化方案,该方案通过视觉评估已学习行为并向LLM提供文本反馈。该反馈引导LLM持续优化奖励函数,最终促进更高效的行为学习。在双足与四足机器人运动控制任务上的实验结果表明,本方法在人类标准化评分指标上优于当前最先进的基于LLM的奖励生成方法超过37.6%。更重要的是,通过切换视频输入,我们发现本方法能够快速学习行走、奔跑等多种运动行为。