While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.
翻译:尽管强化学习在性能上展现出潜力,但其样本复杂度仍是制约其在多个领域广泛应用的重大障碍。模仿学习通过利用专家知识来提升样本效率,但往往受限于所使用专家知识的质量。本文提出鲁棒策略改进方法——一种动态交织模仿学习与强化学习的算法框架,其核心在于基于在线性能评估实时调整两种学习策略的混合比例。该方法充分发挥模仿学习的优势,借助专家查询机制有效促进探索过程,这尤其有利于解决稀疏奖励强化学习在初始学习阶段的探索难题。随着学习进程推进,算法逐步向强化学习过渡,将已学策略视为改进后的专家系统。该框架具备从多元黑盒专家系统中学习并实现超越的能力,其关键组件包括鲁棒主动策略选择与鲁棒策略梯度,二者能够根据当前状态判断是执行基于专家系统的状态级模仿,还是在学习器性能超越特定状态专家时依据自身价值函数进行学习。实验评估与理论分析均证实,该方法在多个基准测试领域中展现出优于现有先进方法的性能表现。