Sample efficiency is central to developing practical reinforcement learning (RL) for complex and large-scale decision-making problems. The ability to transfer and generalize knowledge gained from previous experiences to downstream tasks can significantly improve sample efficiency. Recent research indicates that successor feature (SF) RL algorithms enable knowledge generalization between tasks with different rewards but identical transition dynamics. It has recently been hypothesized that combining model-based (MB) methods with SF algorithms can alleviate the limitation of fixed transition dynamics. Furthermore, uncertainty-aware exploration is widely recognized as another appealing approach for improving sample efficiency. Putting together two ideas of hybrid model-based successor feature (MB-SF) and uncertainty leads to an approach to the problem of sample efficient uncertainty-aware knowledge transfer across tasks with different transition dynamics or/and reward functions. In this paper, the uncertainty of the value of each action is approximated by a Kalman filter (KF)-based multiple-model adaptive estimation. This KF-based framework treats the parameters of a model as random variables. To the best of our knowledge, this is the first attempt at formulating a hybrid MB-SF algorithm capable of generalizing knowledge across large or continuous state space tasks with various transition dynamics while requiring less computation at decision time than MB methods. The number of samples required to learn the tasks was compared to recent SF and MB baselines. The results show that our algorithm generalizes its knowledge across different transition dynamics, learns downstream tasks with significantly fewer samples than starting from scratch, and outperforms existing approaches.
翻译:样本效率对于开发适用于复杂大规模决策问题的实用强化学习(RL)至关重要。将先前经验中获得的知识迁移并泛化至下游任务的能力可显著提升样本效率。近期研究表明,后继特征(SF)RL算法能够在奖励函数不同但状态转移动态相同的任务间实现知识泛化。最新研究假设,将基于模型(MB)方法与SF算法相结合可缓解固定转移动态的限制。此外,不确定性感知探索被广泛认为是提升样本效率的另一有效途径。融合混合模型后继特征(MB-SF)与不确定性感知两种思想,形成了一种解决样本高效、不确定性感知的跨任务知识迁移问题的方法,适用于具有不同转移动态和/或奖励函数的任务。本文通过基于卡尔曼滤波器(KF)的多模型自适应估计来近似每个动作值的不确定性。该KF框架将模型参数视为随机变量。据我们所知,这是首次尝试构建混合MB-SF算法,该算法能够在大规模或连续状态空间任务中实现跨不同转移动态的知识泛化,同时在决策时所需计算量少于传统MB方法。通过将学习任务所需样本量与近期SF和MB基线方法进行比较,结果表明:我们的算法能够在不同转移动态间实现知识泛化,以显著少于从零开始学习的样本量掌握下游任务,且性能优于现有方法。