The reward function is an essential component in robot learning. Reward directly affects the sample and computational complexity of learning, and the quality of a solution. The design of informative rewards requires domain knowledge, which is not always available. We use the properties of the dynamics to produce system-appropriate reward without adding external assumptions. Specifically, we explore an approach to utilize the Lyapunov exponents of the system dynamics to generate a system-immanent reward. We demonstrate that the `Sum of the Positive Lyapunov Exponents' (SuPLE) is a strong candidate for the design of such a reward. We develop a computational framework for the derivation of this reward, and demonstrate its effectiveness on classical benchmarks for sample-based stabilization of various dynamical systems. It eliminates the need to start the training trajectories at arbitrary states, also known as auxiliary exploration. While the latter is a common practice in simulated robot learning, it is unpractical to consider to use it in real robotic systems, since they typically start from natural rest states such as a pendulum at the bottom, a robot on the ground, etc. and can not be easily initialized at arbitrary states. Comparing the performance of SuPLE to commonly-used reward functions, we observe that the latter fail to find a solution without auxiliary exploration, even for the task of swinging up the double pendulum and keeping it stable at the upright position, a prototypical scenario for multi-linked robots. SuPLE-induced rewards for robot learning offer a novel route for effective robot learning in typical as opposed to highly specialized or fine-tuned scenarios. Our code is publicly available for reproducibility and further research.
翻译:奖励函数是机器人学习中的关键组成部分。奖励直接影响学习的样本与计算复杂度以及解的质量。设计信息丰富的奖励需要领域知识,而这并非总能获得。我们利用动力学特性来生成系统适配的奖励,无需引入外部假设。具体而言,我们探索了一种利用系统动力学的李雅普诺夫指数来生成系统内蕴奖励的方法。我们证明了“正李雅普诺夫指数之和”(SuPLE)是设计此类奖励的有力候选方案。我们开发了一个计算框架来推导该奖励,并在多种动力学系统的基于样本的稳定性经典基准测试中验证了其有效性。该方法消除了从任意状态启动训练轨迹(通常称为辅助探索)的需求。尽管辅助探索在模拟机器人学习中是一种常见做法,但在真实机器人系统中考虑使用它并不实际,因为真实系统通常从自然静止状态(如摆锤在底部、机器人在地面等)开始,难以轻易初始化为任意状态。通过比较SuPLE与常用奖励函数的性能,我们观察到后者在没有辅助探索的情况下无法找到解,即使在双摆上摆并保持其在直立位置稳定(多连杆机器人的典型场景)这一任务中也是如此。SuPLE诱导的机器人学习奖励为典型场景(而非高度专业化或精细调优的场景)中的有效机器人学习提供了一条新途径。我们的代码已公开,以促进可复现性与进一步研究。