The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.
翻译:教师-学生框架(Teacher-Student Framework, TSF)是一种强化学习设置,其中教师代理通过干预并提供在线示范来保护学生代理的训练过程。假设教师策略是最优的,则其能够以完美的时机和能力干预学生代理的学习过程,从而提供安全保障和探索引导。然而,在许多实际场景中,获得表现良好的教师策略往往成本高昂甚至无法实现。本研究放宽了教师策略性能良好的假设,提出了一种能够融合任意性能水平(包括中等或较差)教师策略的新方法。我们实例化了一种离策略强化学习算法,称为教师-学生共享控制(Teacher-Student Shared Control, TS2C),该算法基于轨迹价值估计引入教师干预。理论分析验证了所提出的TS2C算法能够在不受教师自身性能影响的情况下实现高效探索和可靠安全保障。在多种连续控制任务上的实验表明,我们的方法能够利用不同性能水平的教师策略,同时保持较低的训练成本。此外,在保留测试环境中,学生策略在累积奖励方面超越了不完美的教师策略。代码开源地址:https://metadriverse.github.io/TS2C。