CAT: Caution Aware Transfer in Reinforcement Learning via Distributional Risk

Transfer learning in reinforcement learning (RL) has become a pivotal strategy for improving data efficiency in new, unseen tasks by utilizing knowledge from previously learned tasks. This approach is especially beneficial in real-world deployment scenarios where computational resources are constrained and agents must adapt rapidly to novel environments. However, current state-of-the-art methods often fall short in ensuring safety during the transfer process, particularly when unforeseen risks emerge in the deployment phase. In this work, we address these limitations by introducing a novel Caution-Aware Transfer Learning (CAT) framework. Unlike traditional approaches that limit risk considerations to mean-variance, we define "caution" as a more generalized and comprehensive notion of risk. Our core innovation lies in optimizing a weighted sum of reward return and caution-based on state-action occupancy measures-during the transfer process, allowing for a rich representation of diverse risk factors. To the best of our knowledge, this is the first work to explore the optimization of such a generalized risk notion within the context of transfer RL. Our contributions are threefold: (1) We propose a Caution-Aware Transfer (CAT) framework that evaluates source policies within the test environment and constructs a new policy that balances reward maximization and caution. (2) We derive theoretical sub-optimality bounds for our method, providing rigorous guarantees of its efficacy. (3) We empirically validate CAT, demonstrating that it consistently outperforms existing methods by delivering safer policies under varying risk conditions in the test tasks.

翻译：强化学习中的迁移学习已成为通过利用先前学习任务中的知识来提高新未见任务数据效率的关键策略。该方法在计算资源受限且智能体必须快速适应新环境的实际部署场景中尤为有益。然而，当前最先进的方法在确保迁移过程安全性方面往往存在不足，特别是在部署阶段出现不可预见风险时。本研究通过引入一种新颖的谨慎感知迁移学习框架来解决这些局限性。与将风险考量局限于均值-方差的传统方法不同，我们将"谨慎"定义为更广义和全面的风险概念。我们的核心创新在于迁移过程中基于状态-动作占用测度优化奖励回报与谨慎度的加权和，从而实现对多样化风险因素的丰富表征。据我们所知，这是首个在迁移强化学习背景下探索此类广义风险概念优化的研究。我们的贡献包括三个方面：(1) 提出谨慎感知迁移框架，在测试环境中评估源策略并构建平衡奖励最大化与谨慎度的新策略；(2) 推导该方法的理论次优性边界，为其有效性提供严格保证；(3) 通过实验验证CAT框架，证明其在测试任务的不同风险条件下始终优于现有方法，能提供更安全的策略。