Learning reward functions for physical skills are challenging due to the vast spectrum of skills, the high-dimensionality of state and action space, and nuanced sensory feedback. The complexity of these tasks makes acquiring expert demonstration data both costly and time-consuming. Large Language Models (LLMs) contain valuable task-related knowledge that can aid in learning these reward functions. However, the direct application of LLMs for proposing reward functions has its limitations such as numerical instability and inability to incorporate the environment feedback. We aim to extract task knowledge from LLMs using environment feedback to create efficient reward functions for physical skills. Our approach consists of two components. We first use the LLM to propose features and parameterization of the reward function. Next, we update the parameters of this proposed reward function through an iterative self-alignment process. In particular, this process minimizes the ranking inconsistency between the LLM and our learned reward functions based on the new observations. We validated our method by testing it on three simulated physical skill learning tasks, demonstrating effective support for our design choices.
翻译:学习物理技能的奖励函数因技能种类繁多、状态和动作空间高维以及细微的感觉反馈而具有挑战性。这些任务的复杂性使得获取专家示范数据既昂贵又耗时。大型语言模型(LLMs)包含有价值的任务相关知识,有助于学习这些奖励函数。然而,直接应用LLMs提出奖励函数存在局限性,例如数值不稳定和无法融入环境反馈。我们旨在利用环境反馈从LLMs中提取任务知识,为物理技能创建高效的奖励函数。我们的方法包含两个组成部分。首先,我们使用LLM提出奖励函数的特征和参数化方案。接着,我们通过迭代的自对齐过程更新所提出的奖励函数的参数。具体来说,该过程基于新的观测数据,最小化LLM与所学奖励函数之间的排序不一致性。我们通过在三个模拟物理技能学习任务上测试该方法验证了其有效性,证明了我们的设计选择具有良好支撑。