Adversarial Policy Optimization in Deep Reinforcement Learning

The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the agent struggles to learn a useful policy. Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting. However, such data augmentation is a form of prior knowledge, and naively applying them in environments might worsen an agent's performance. In this paper, we propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy. Our approach consists of a max-min game theoretic objective where a perturber network modifies the state to maximize the agent's probability of taking a different action while minimizing the distortion in the state. In contrast, the policy network updates its parameters to minimize the effect of perturbation while maximizing the expected future reward. Based on this objective, we propose a practical deep reinforcement learning algorithm, Adversarial Policy Optimization (APO). Our method is agnostic to the type of policy optimization, and thus data augmentation can be incorporated to harness the benefit. We evaluated our approaches on several DeepMind Control robotic environments with high-dimensional and noisy state settings. Empirical results demonstrate that our method APO consistently outperforms the state-of-the-art on-policy PPO agent. We further compare our method with state-of-the-art data augmentation, RAD, and regularization-based approach DRAC. Our agent APO shows better performance compared to these baselines.

翻译：深度神经网络所表征的策略可能会过拟合观测中的伪特征，这阻碍了强化学习智能体学习有效策略。在高维状态空间中，这一问题尤为严重，智能体往往难以学到有用的策略。数据增强可以通过减轻过拟合效应来提升强化学习智能体的性能。然而，这类数据增强本质上是一种先验知识，在环境中盲目应用可能会降低智能体的性能。本文提出了一种新型强化学习算法以缓解上述问题并提升所学策略的效率。我们的方法包含一个最大最小博弈论目标函数：扰动器网络通过修改状态来最大化智能体采取不同动作的概率，同时最小化状态中的畸变；而策略网络则在更新参数时最小化扰动的影响，同时最大化期望未来奖励。基于这一目标，我们提出了一种实用的深度强化学习算法——对抗策略优化（APO）。该方法与策略优化类型无关，因此可融入数据增强以充分利用其优势。我们在多个存在高维噪声状态的DeepMind控制机器人环境中评估了所提方法。实验结果表明，APO方法在效果上始终优于当前最先进的同策略PPO智能体。我们进一步将所提方法与最先进的数据增强方法RAD及基于正则化的DRAC方法进行对比，我们的APO智能体展现了优于这些基线方法的性能。