The ability to autonomously explore and resolve tasks with minimal human guidance is crucial for the self-development of embodied intelligence. Although reinforcement learning methods can largely ease human effort, it's challenging to design reward functions for real-world tasks, especially for high-dimensional robotic control, due to complex relationships among joints and tasks. Recent advancements large language models (LLMs) enable automatic reward function design. However, approaches evaluate reward functions by re-training policies from scratch placing an undue burden on the reward function, expecting it to be effective throughout the whole policy improvement process. We argue for a more practical strategy in robotic autonomy, focusing on refining existing policies with policy-dependent reward functions rather than a universal one. To this end, we propose a novel reward-policy co-evolution framework where the reward function and the learned policy benefit from each other's progressive on-the-fly improvements, resulting in more efficient and higher-performing skill acquisition. Specifically, the reward evolution process translates the robot's previous best reward function, descriptions of tasks and environment into text inputs. These inputs are used to query LLMs to generate a dynamic amount of reward function candidates, ensuring continuous improvement at each round of evolution. For policy evolution, our method generates new policy populations by hybridizing historically optimal and random policies. Through an improved Bayesian optimization, our approach efficiently and robustly identifies the most capable and plastic reward-policy combination, which then proceeds to the next round of co-evolution. Despite using less data, our approach demonstrates an average normalized improvement of 95.3% across various high-dimensional robotic skill learning tasks.
翻译:自主探索与解决任务的能力对于具身智能的自我发展至关重要。尽管强化学习方法能大幅减轻人力负担,但由于关节与任务间复杂的关联性,为现实世界任务(尤其是高维机器人控制)设计奖励函数仍具挑战性。大型语言模型(LLMs)的最新进展使得自动奖励函数设计成为可能。然而,现有方法通过从头重新训练策略来评估奖励函数,这给奖励函数带来了不当负担——期望其在整个策略改进过程中始终保持有效。我们主张采用更实用的机器人自主策略:聚焦于利用策略依赖型奖励函数改进现有策略,而非追求通用奖励函数。为此,我们提出一种新颖的奖励-策略协同进化框架,其中奖励函数与学习策略通过彼此持续进行的即时改进实现互利共赢,从而实现更高效、更高性能的技能获取。具体而言,奖励进化过程将机器人先前的最佳奖励函数、任务描述与环境信息转化为文本输入,并以此查询LLMs以生成动态数量的奖励函数候选,确保每轮进化都能持续改进。在策略进化方面,我们的方法通过融合历史最优策略与随机策略生成新的策略种群。借助改进的贝叶斯优化,本方法能高效稳健地识别出最具潜力与适应性的奖励-策略组合,并将其投入下一轮协同进化。尽管使用更少数据,我们的方法在多种高维机器人技能学习任务中实现了平均95.3%的归一化性能提升。