Traditional Reinforcement Learning (RL) algorithms are usually applied in robotics to learn controllers that act with a fixed control rate. Given the discrete nature of RL algorithms, they are oblivious to the effects of the choice of control rate: finding the correct control rate can be difficult and mistakes often result in excessive use of computing resources or even lack of convergence. We propose Soft Elastic Actor-Critic (SEAC), a novel off-policy actor-critic algorithm to address this issue. SEAC implements elastic time steps, time steps with a known, variable duration, which allow the agent to change its control frequency to adapt to the situation. In practice, SEAC applies control only when necessary, minimizing computational resources and data usage. We evaluate SEAC's capabilities in simulation in a Newtonian kinematics maze navigation task and on a 3D racing video game, Trackmania. SEAC outperforms the SAC baseline in terms of energy efficiency and overall time management, and most importantly without the need to identify a control frequency for the learned controller. SEAC demonstrated faster and more stable training speeds than SAC, especially at control rates where SAC struggled to converge. We also compared SEAC with a similar approach, the Continuous-Time Continuous-Options (CTCO) model, and SEAC resulted in better task performance. These findings highlight the potential of SEAC for practical, real-world RL applications in robotics.
翻译:传统强化学习算法通常应用于机器人领域,以固定控制频率学习控制器。由于强化学习算法的离散特性,它们忽略了控制频率选择的影响:确定正确的控制频率可能很困难,而错误往往导致计算资源的过度使用甚至无法收敛。我们提出软弹性演员-评论家算法(SEAC),一种新的离策略演员-评论家算法来解决这一问题。SEAC实现了弹性时间步长,即具有已知可变持续时间的时间步长,使智能体能够改变控制频率以适应不同情境。在实践中,SEAC仅在必要时施加控制,从而最小化计算资源和数据使用。我们在牛顿运动学迷宫导航任务和3D赛车游戏Trackmania的仿真环境中评估了SEAC的性能。SEAC在能源效率和整体时间管理方面优于SAC基线,并且最重要的是无需为学习控制器确定控制频率。SEAC展现出比SAC更快且更稳定的训练速度,尤其在SAC难以收敛的控制频率下。我们还将SEAC与类似方法——连续时间连续选项模型(CTCO)进行了比较,SEAC取得了更好的任务性能。这些发现凸显了SEAC在机器人领域实际强化学习应用中的潜力。