Recently, reward-conditioned reinforcement learning (RCRL) has gained popularity due to its simplicity, flexibility, and off-policy nature. However, we will show that current RCRL approaches are fundamentally limited and fail to address two critical challenges of RCRL -- improving generalization on high reward-to-go (RTG) inputs, and avoiding out-of-distribution (OOD) RTG queries during testing time. To address these challenges when training vanilla RCRL architectures, we propose Bayesian Reparameterized RCRL (BR-RCRL), a novel set of inductive biases for RCRL inspired by Bayes' theorem. BR-RCRL removes a core obstacle preventing vanilla RCRL from generalizing on high RTG inputs -- a tendency that the model treats different RTG inputs as independent values, which we term ``RTG Independence". BR-RCRL also allows us to design an accompanying adaptive inference method, which maximizes total returns while avoiding OOD queries that yield unpredictable behaviors in vanilla RCRL methods. We show that BR-RCRL achieves state-of-the-art performance on the Gym-Mujoco and Atari offline RL benchmarks, improving upon vanilla RCRL by up to 11%.
翻译:近年来,奖励条件强化学习(RCRL)因其简洁性、灵活性和离线策略特性而广受关注。然而,我们将揭示当前RCRL方法存在根本性局限,未能解决RCRL的两个关键挑战——提高对高预期奖励(RTG)输入的泛化能力,以及避免测试时出现分布外(OOD)的RTG查询。为应对标准RCRL架构训练中的这些挑战,我们提出贝叶斯重参数化RCRL(BR-RCRL),这是一套受贝叶斯定理启发的新型RCRL归纳偏置。BR-RCRL消除了阻碍标准RCRL在高RTG输入上泛化的核心障碍——模型将不同RTG输入视为独立值的倾向,我们将其称为"RTG独立性"。BR-RCRL还使我们能够设计一种配套的自适应推理方法,该方法在最大化总回报的同时,避免产生导致标准RCRL方法出现不可预测行为的OOD查询。实验表明,BR-RCRL在Gym-Mujoco和Atari离线强化学习基准上达到了最先进的性能,相较于标准RCRL方法的性能提升高达11%。