Model-free deep reinforcement learning has achieved great success in many domains, such as video games, recommendation systems and robotic control tasks. In continuous control tasks, widely used policies with Gaussian distributions results in ineffective exploration of environments and limited performance of algorithms in many cases. In this paper, we propose a density-free off-policy algorithm, Generative Actor-Critic(GAC), using the push-forward model to increase the expressiveness of policies, which also includes an entropy-like technique, MMD-entropy regularizer, to balance the exploration and exploitation. Additionnally, we devise an adaptive mechanism to automatically scale this regularizer, which further improves the stability and robustness of GAC. The experiment results show that push-forward policies possess desirable features, such as multi-modality, which can improve the efficiency of exploration and asymptotic performance of algorithms obviously.
翻译:无模型深度强化学习在视频游戏、推荐系统和机器人控制任务等多个领域取得了巨大成功。在连续控制任务中,广泛使用的高斯分布策略导致环境探索效率低下,且算法性能在许多情况下受到限制。本文提出一种无密度无策略算法——生成式行为者-评论家,利用推前模型增强策略的表达能力,并引入类熵技术MMD熵正则化器以平衡探索与利用。此外,我们设计了一种自适应机制自动调整该正则化器,进一步提升了算法的稳定性和鲁棒性。实验结果表明,推前策略具备多模态等理想特性,能够显著提高探索效率及算法的渐近性能。