Learning from rewards (i.e., reinforcement learning or RL) and learning to imitate a teacher (i.e., teacher-student learning) are two established approaches for solving sequential decision-making problems. To combine the benefits of these different forms of learning, it is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. However, without a principled method to balance these objectives, prior work used heuristics and problem-specific hyperparameter searches to balance the two objectives. We present a $\textit{principled}$ approach, along with an approximate implementation for $\textit{dynamically}$ and $\textit{automatically}$ balancing when to follow the teacher and when to use rewards. The main idea is to adjust the importance of teacher supervision by comparing the agent's performance to the counterfactual scenario of the agent learning without teacher supervision and only from rewards. If using teacher supervision improves performance, the importance of teacher supervision is increased and otherwise it is decreased. Our method, $\textit{Teacher Guided Reinforcement Learning}$ (TGRL), outperforms strong baselines across diverse domains without hyper-parameter tuning.
翻译:从奖励中学习(即强化学习或RL)与模仿教师学习(即师生学习)是解决序列决策问题的两种经典方法。为结合这两种不同学习形式的优势,常见的做法是训练一个策略,使其在强化目标与师生学习目标的联合最大化中取得平衡。然而,现有工作缺乏协调这两类目标的原则性方法,往往依赖启发式规则和针对特定问题的超参数搜索来调节平衡。本文提出了一种$\textit{原则性}$方法,并给出了$\textit{动态}$且$\textit{自动}$平衡何时遵从教师指导、何时利用奖励信号的近似实现方案。核心思想是通过对比智能体在有教师监督与无教师监督(仅从奖励学习)的反事实场景下的表现差异,动态调整教师监督的重要性权重:若采用教师监督能提升性能,则增大其重要性权重;反之则降低。我们的方法——$\textit{教师引导的强化学习}$(TGRL)——在无需超参数调优的情况下,在多个不同领域均显著优于强基线方法。