Model-free Reinforcement Learning (RL) generally suffers from poor sample complexity, mostly due to the need to exhaustively explore the state-action space to find well-performing policies. On the other hand, we postulate that expert knowledge of the system often allows us to design simple rules we expect good policies to follow at all times. In this work, we hence propose a simple yet effective modification of continuous actor-critic frameworks to incorporate such rules and avoid regions of the state-action space that are known to be suboptimal, thereby significantly accelerating the convergence of RL agents. Concretely, we saturate the actions chosen by the agent if they do not comply with our intuition and, critically, modify the gradient update step of the policy to ensure the learning process is not affected by the saturation step. On a room temperature control case study, it allows agents to converge to well-performing policies up to 6-7x faster than classical agents without computational overhead and while retaining good final performance.
翻译:无模型强化学习通常面临样本效率低下的问题,主要原因是需要穷举探索状态-动作空间以找到性能良好的策略。另一方面,我们假设对系统的专家知识往往允许我们设计出预期良好策略始终遵循的简单规则。为此,本文提出一种简单而有效的连续型演员-评论家框架改进方法,通过整合此类规则来规避已知次优的状态-动作区域,从而显著加速强化学习智能体的收敛速度。具体而言,当智能体选择的动作不符合我们的直觉时,我们对其动作进行饱和处理,并关键性地修改策略的梯度更新步骤,以确保学习过程不受饱和步骤影响。在室温控制案例研究中,该方法使智能体能够比经典方法快6-7倍收敛到性能良好的策略,且无需增加计算开销,同时保持优异的最终性能。