Policy-gradient algorithms are effective reinforcement learning methods for solving control problems with continuous state and action spaces. To compute near-optimal policies, it is essential in practice to include exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis and distinguish two different implications of these techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter update eventually provides an optimal policy. In light of these effects, we discuss and illustrate empirically exploration strategies based on entropy bonuses, highlighting their limitations and opening avenues for future works in the design and analysis of such strategies.
翻译:政策梯度算法是解决连续状态和动作空间控制问题的有效强化学习方法。为计算近优策略,实践中必须在学习目标中包含探索项。尽管这些项的有效性通常基于探索环境的固有需求,我们提出一种新颖分析,区分了这些技术的两种不同影响。首先,它们能够平滑学习目标,在保留全局最大值的同时消除局部最优。其次,它们修改梯度估计,增加随机参数更新最终提供最优策略的概率。基于这些效应,我们讨论并实证说明了基于熵奖励的探索策略,突出其局限性,并为未来此类策略的设计与分析开辟研究路径。