Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.
翻译:离线强化学习(RL)的最新进展凸显了基于返回条件监督学习(RCSL)范式的能力,该范式以一种监督方式学习基于目标回报在每个状态下的动作分布。然而,主流的 RCSL 方法主要关注确定性轨迹建模,忽视了随机状态转移和未来轨迹分布的多样性。一个基本挑战源于单个轨迹内采样回报与多个轨迹间期望回报之间的不一致性。幸运的是,基于值的方法通过利用值函数逼近期望回报提供了一种解决方案,从而有效解决了这种不一致性。基于这些见解,我们提出了一种新颖方法,称为评论引导的决策 Transformer(CGDT),它结合了基于值方法的长期回报可预测性与决策 Transformer 的轨迹建模能力。通过引入一个学习到的值函数(称为评论家),CGDT 确保了指定目标回报与动作期望回报之间的直接对齐。这种整合弥合了 RCSL 的确定性与基于值方法的概率特性之间的差距。在随机环境和 D4RL 基准数据集上的实证评估证明了 CGDT 相较于传统 RCSL 方法的优越性。这些结果凸显了 CGDT 在推进离线 RL 技术前沿以及将 RCSL 推广到广泛 RL 任务中的潜力。