Deep reinforcement learning was instigated with the presence of trust region methods, being scalable and efficient. However, the pessimism of such algorithms, among which it forces to constrain in a trust region by all means, has been proven to suppress the exploration and harm the performance. Exploratory algorithm such as SAC, while utilizes the entropy to encourage exploration, implicitly optimizing another objective yet. We first observed this inconsistency, and therefore put forward an analogous augmentation technique, which combines well with the on-policy algorithms, when a value critic is involved. Surprisingly, the proposed method consistently satisfies the soft policy improvement theorem, while being more extensible. As the analysis advises, it is crucial to control the temperature coefficient to balance the exploration and exploitation. Empirical tests on MuJoCo benchmark tasks show that the agent is heartened towards higher reward regions, and enjoys a finer performance. Furthermore, we verify the exploration bonus of our method on a set of custom environments.
翻译:深度强化学习在信任区域方法的引入下实现了可扩展性和高效性。然而,此类算法中的悲观主义——即强制以各种方式约束在信任区域内,已被证明会抑制探索并损害性能。诸如SAC等探索性算法虽利用熵来鼓励探索,却隐含地优化了另一个目标。我们首次观察到这一不一致性,因此提出了一种类似的增强技术,该技术能够在引入价值评论家的情况下,很好地与在策略算法结合。令人惊讶的是,所提出的方法始终满足软策略改进定理,同时具有更强的可扩展性。正如分析所表明的,控制温度系数以平衡探索与利用至关重要。在MuJoCo基准任务上的实证测试表明,智能体被激励向更高奖励区域移动,并获得了更优的性能。此外,我们在一组自定义环境中验证了该方法带来的探索奖励。