Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement over training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.
翻译:学习奖励函数仍然是赋予机器人广泛技能集的关键瓶颈。大型语言模型(LLM)蕴含与任务相关的宝贵知识,有助于奖励函数的学习。然而,LLM提出的奖励函数可能不够精确,从而效果不佳,需要进一步结合环境信息进行验证。我们提出了一种无需人类干预即可更高效学习奖励的方法。该方法包含两个组成部分:首先利用LLM提出奖励的特征与参数化形式,然后通过迭代自对齐过程更新参数。具体而言,该过程基于执行反馈,最小化LLM与所学奖励函数之间的排序不一致性。该方法在2个仿真环境的9个任务上进行了验证,结果表明其在训练效果与效率上均有一致的提升,同时相比基于变异的替代方法,显著减少了GPT令牌的消耗量。