Autonomous navigation in crowded environments is an open problem with many applications, essential for the coexistence of robots and humans in the smart cities of the future. In recent years, deep reinforcement learning approaches have proven to outperform model-based algorithms. Nevertheless, even though the results provided are promising, the works are not able to take advantage of the capabilities that their models offer. They usually get trapped in local optima in the training process, that prevent them from learning the optimal policy. They are not able to visit and interact with every possible state appropriately, such as with the states near the goal or near the dynamic obstacles. In this work, we propose using intrinsic rewards to balance between exploration and exploitation and explore depending on the uncertainty of the states instead of on the time the agent has been trained, encouraging the agent to get more curious about unknown states. We explain the benefits of the approach and compare it with other exploration algorithms that may be used for crowd navigation. Many simulation experiments are performed modifying several algorithms of the state-of-the-art, showing that the use of intrinsic rewards makes the robot learn faster and reach higher rewards and success rates (fewer collisions) in shorter navigation times, outperforming the state-of-the-art.
翻译:自主导航于拥挤环境是一个具有众多应用的开放性问题,对于未来智能城市中机器人与人类的共存至关重要。近年来,深度强化学习方法已被证明优于基于模型的算法。然而,尽管已有成果前景可观,现有工作却未能充分利用其模型所具备的能力。它们在训练过程中往往陷入局部最优,进而阻碍最优策略的学习。这些方法无法充分访问每种可能的状态并与之交互,例如接近目标或动态障碍物的状态。本研究提出利用内在奖励来平衡探索与利用,根据状态的不确定性而非智能体的训练时间进行探索,激励智能体对未知状态产生更多好奇。我们阐述了该方法的优势,并将其与可用于人群导航的其他探索算法进行比较。通过修改多项先进算法进行大量仿真实验,结果表明,采用内在奖励能使机器人在更短的导航时间内更快学习、获得更高奖励与成功率(更少碰撞),从而超越现有技术水平。