Ballbot (i.e. Ball balancing robot) navigation usually relies on methods rooted in control theory (CT), and works that apply Reinforcement learning (RL) to the problem remain rare while generally being limited to specific subtasks (e.g. balance recovery). Unlike CT based methods, RL does not require (simplifying) assumptions about environment dynamics (e.g. the absence of slippage between the ball and the floor). In addition to this increased accuracy in modeling, RL agents can easily be conditioned on additional observations such as depth-maps without the need for explicit formulations from first principles, leading to increased adaptivity. Despite those advantages, there has been little to no investigation into the capabilities, data-efficiency and limitations of RL based methods for ballbot control and navigation. Furthermore, there is a notable absence of an open-source, RL-friendly simulator for this task. In this paper, we present an open-source ballbot simulation based on MuJoCo, and show that with appropriate conditioning on exteroceptive observations as well as reward shaping, policies learned by classical model-free RL methods are capable of effectively navigating through randomly generated uneven terrain, using a reasonable amount of data (four to five hours on a system operating at 500hz). Our code is made publicly available.
翻译:球型机器人(即球平衡机器人)的导航通常依赖于基于控制理论的方法,而将强化学习应用于该问题的研究仍然较少,且通常局限于特定子任务(例如平衡恢复)。与控制理论方法不同,强化学习不需要对环境动力学(例如球与地面之间无滑移)做出(简化)假设。除了建模精度提高之外,强化学习智能体可以轻松地以额外的观测(如深度图)为条件,而无需基于第一性原理进行显式公式推导,从而提高了适应性。尽管有这些优势,但对于基于强化学习的球型机器人控制和导航方法的能力、数据效率及局限性,目前几乎没有研究。此外,该任务明显缺乏一个开源的、对强化学习友好的模拟器。在本文中,我们提出了一个基于MuJoCo的开源球型机器人模拟器,并证明通过适当地以外感受观测为条件以及奖励塑形,经典无模型强化学习方法习得的策略能够有效地在随机生成的不平坦地形中导航,且仅需合理的数据量(在500赫兹运行的系统上约四到五小时)。我们的代码已公开。