Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.

翻译：贝叶斯策略复用（BPR）是一种通用的策略迁移框架，通过基于观测信号和训练好的观测模型推断任务信念，从离线库中选择源策略。本文提出一种改进的BPR方法，以在深度强化学习（DRL）中实现更高效的策略迁移。首先，大多数BPR算法将回合回报作为观测信号，但该信号包含信息有限且需在回合结束时才能获取。为此，我们采用信息丰富且可即时获取的状态转移样本作为观测信号，以实现更快更准确的任务推断。其次，BPR算法通常需要大量样本来估计基于表格的观测模型的概率分布，这可能导致学习和维护成本高昂甚至不可行，特别是在使用状态转移样本作为信号时。因此，我们提出一种可扩展的观测模型，基于少量样本拟合源任务的状态转移函数，该模型可泛化至目标任务中观测到的任意信号。此外，我们将离线模式的BPR扩展至持续学习场景，通过即插即用方式扩展可扩展观测模型，从而避免面对未知新任务时的负迁移。实验结果表明，我们的方法能够持续促进更快、更高效的策略迁移。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/