Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft Actor-Critic and is much less sensitive to entropy-regularization.
翻译:许多策略梯度方法是演员-评论家(AC)的变体,其中学习一个价值函数(评论家)以促进更新参数化策略(演员)。对演员的更新涉及由动作值加权的对数似然更新,并针对软变体增加熵正则化。在本工作中,我们探索了一种基于交叉熵方法(CEM)扩展至条件输入(状态)的演员更新替代方案。其思想是从一个较宽的策略开始,并逐渐集中于最大动作,通过对每个状态前百分位内的动作进行最大似然更新来实现。这种集中的速度由提议策略控制,其集中速度比演员慢。我们首先在理想化场景中证明了策略改进结果,然后证明我们的条件CEM(CCEM)策略即使在动作值变化时也能追踪每个状态的CEM更新。实验表明,采用CCEM进行演员更新的贪婪AC算法性能优于软演员-评论家,并且对熵正则化的敏感性大大降低。