Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.
翻译:大多数强化学习算法旨在寻求解决给定任务的单一最优策略。然而,学习一组多样化的解决方案往往具有重要价值,例如可使智能体与用户的交互更具吸引力,或提升策略应对意外扰动的鲁棒性。本文提出一种基于多样性引导策略优化的在策略算法DGPO(Diversity-Guided Policy Optimization),该算法能在单次训练中通过共享策略网络发现解决给定任务的多种策略。具体而言,我们基于信息论多样性目标设计内在奖励,并通过交替约束策略多样性与外在奖励构建优化目标。通过将约束优化问题转化为概率推理任务,我们采用策略迭代方法最大化推导所得下界。实验结果表明,该方法能在多种强化学习任务中高效发现多样化策略。与基线方法相比,DGPO在保持相当奖励水平的同时,能发现更丰富的策略类型,且通常具有更优的样本效率。