We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations REAG$_\text{Dara}^{*}$ and REAG$_\text{MV}^{*}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.
翻译:本研究探讨离线离动态强化学习(RL),旨在利用易于获取的源域数据来增强目标域中数据有限情况下的策略学习。我们的方法以回报条件监督学习(RCSL)为核心,特别关注决策Transformer(DT)类框架,该框架能够基于期望的回报指导和完整轨迹历史来预测动作。先前的研究通过调整源域轨迹中的奖励以匹配目标域中的最优轨迹来解决动态偏移问题。然而,由于(1)RCSL策略类的独特形式明确依赖于回报,以及(2)缺乏最优轨迹分布的直观表示,该策略无法直接应用于RCSL。我们提出了用于DT类框架的回报增强(REAG)方法,通过将源域的回报分布与目标域对齐来增强源域回报。我们提供了理论分析,证明从REAG学习到的RCSL策略能够达到与无动态偏移情况下相同的次优性水平。我们分别介绍了两种实际实现方法:REAG$_\text{Dara}^{*}$和REAG$_\text{MV}^{*}$。在D4RL数据集和各种DT类基线模型上的全面实验表明,我们的方法能持续提升DT类框架在离动态RL中的性能。