Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.
翻译:贝叶斯策略复用(BPR)是一种通用的策略迁移框架,通过基于观测信号和训练好的观测模型推断任务信念,从离线库中选择源策略。本文提出一种改进的BPR方法,以在深度强化学习(DRL)中实现更高效的策略迁移。首先,大多数BPR算法将回合回报作为观测信号,但该信号包含信息有限且需在回合结束时才能获取。为此,我们采用信息丰富且可即时获取的状态转移样本作为观测信号,以实现更快更准确的任务推断。其次,BPR算法通常需要大量样本来估计基于表格的观测模型的概率分布,这可能导致学习和维护成本高昂甚至不可行,特别是在使用状态转移样本作为信号时。因此,我们提出一种可扩展的观测模型,基于少量样本拟合源任务的状态转移函数,该模型可泛化至目标任务中观测到的任意信号。此外,我们将离线模式的BPR扩展至持续学习场景,通过即插即用方式扩展可扩展观测模型,从而避免面对未知新任务时的负迁移。实验结果表明,我们的方法能够持续促进更快、更高效的策略迁移。