Deployable Reinforcement Learning with Variable Control Rate

Deploying controllers trained with Reinforcement Learning (RL) on real robots can be challenging: RL relies on agents' policies being modeled as Markov Decision Processes (MDPs), which assume an inherently discrete passage of time. The use of MDPs results in that nearly all RL-based control systems employ a fixed-rate control strategy with a period (or time step) typically chosen based on the developer's experience or specific characteristics of the application environment. Unfortunately, the system should be controlled at the highest, worst-case frequency to ensure stability, which can demand significant computational and energy resources and hinder the deployability of the controller on onboard hardware. Adhering to the principles of reactive programming, we surmise that applying control actions only when necessary enables the use of simpler hardware and helps reduce energy consumption. We challenge the fixed frequency assumption by proposing a variant of RL with variable control rate. In this approach, the policy decides the action the agent should take as well as the duration of the time step associated with that action. In our new setting, we expand Soft Actor-Critic (SAC) to compute the optimal policy with a variable control rate, introducing the Soft Elastic Actor-Critic (SEAC) algorithm. We show the efficacy of SEAC through a proof-of-concept simulation driving an agent with Newtonian kinematics. Our experiments show higher average returns, shorter task completion times, and reduced computational resources when compared to fixed rate policies.

翻译：将强化学习训练的控制器部署到真实机器人上颇具挑战性：强化学习依赖于将智能体策略建模为马尔可夫决策过程，而该过程默认采用离散时间推进方式。马尔可夫决策过程的使用导致几乎所有基于强化学习的控制系统都采用固定速率控制策略，其周期（或时间步长）通常根据开发人员的经验或应用环境的特定特征选择。不幸的是，系统必须以最高最坏情况频率进行控制以确保稳定性，这会消耗大量计算和能源资源，并阻碍控制器在机载硬件上的可部署性。遵循反应式编程原理，我们推测仅在必要时施加控制动作，能够使用更简单的硬件并有助于降低能耗。我们通过提出一种具有可变控制率的强化学习变体来挑战固定频率假设。在该方法中，策略不仅决定智能体应采取的动作，还决定与该动作关联的时间步长持续时间。在新设定中，我们将软演员-评论家算法扩展为可计算可变控制率下的最优策略，引入软弹性演员-评论家算法。通过驱动牛顿运动学智能体的概念验证仿真，我们展示了SEAC算法的有效性。实验表明，与固定速率策略相比，该方法实现了更高的平均回报、更短的任务完成时间以及更低的计算资源消耗。