A key challenge for a reinforcement learning (RL) agent is to incorporate external/expert1 advice in its learning. The desired goals of an algorithm that can shape the learning of an RL agent with external advice include (a) maintaining policy invariance; (b) accelerating the learning of the agent; and (c) learning from arbitrary advice [3]. To address this challenge this paper formulates the problem of incorporating external advice in RL as a multi-armed bandit called shaping-bandits. The reward of each arm of shaping bandits corresponds to the return obtained by following the expert or by following a default RL algorithm learning on the true environment reward.We show that directly applying existing bandit and shaping algorithms that do not reason about the non-stationary nature of the underlying returns can lead to poor results. Thus we propose UCB-PIES (UPIES), Racing-PIES (RPIES), and Lazy PIES (LPIES) three different shaping algorithms built on different assumptions that reason about the long-term consequences of following the expert policy or the default RL algorithm. Our experiments in four different settings show that these proposed algorithms achieve the above-mentioned goals whereas the other algorithms fail to do so.
翻译:强化学习智能体面临的一个关键挑战是如何在学习过程中融入外部/专家建议。能够利用外部建议塑造智能体学习的算法应满足以下目标:(a) 保持策略不变性;(b) 加速智能体学习进程;(c) 能够从任意建议中学习[3]。为应对这一挑战,本文将强化学习中的外部建议融入问题形式化为一种名为"塑形赌博机"的多臂赌博机模型。该塑形赌博机各臂的奖励对应于遵循专家策略或基于真实环境奖励的默认强化学习算法所获得的回报。研究表明,若直接应用现有赌博机算法和塑形算法(未考虑底层回报的非平稳特性)将导致效果不佳。为此,我们提出三种基于不同假设的新型塑形算法:UCB-PIES (UPIES)、Racing-PIES (RPIES) 和 Lazy PIES (LPIES),这些算法能够充分考虑遵循专家策略或默认强化学习算法所产生的长期影响。在四种不同设定下的实验表明,所提算法能够实现上述目标,而其他算法则无法达成。