This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.
翻译:本文探讨了在训练随机策略的强化学习算法中动态熵调节的影响,并将其性能与训练确定性策略的算法进行比较。随机策略通过优化动作的概率分布来最大化奖励,而确定性策略则为每个状态选择单一的确定性动作。研究探索了使用静态熵和动态熵训练随机策略后,执行确定性动作来控制四旋翼的效果,并将其与训练确定性策略并执行确定性动作的方法进行对比。在本研究中,随机算法选用Soft Actor-Critic,确定性算法则选用Twin Delayed Deep Deterministic Policy Gradient。训练与仿真结果表明,动态熵调节通过防止灾难性遗忘并提升探索效率,对四旋翼控制产生了积极影响。