It remains a critical challenge to adapt policies across domains with mismatched dynamics in reinforcement learning (RL). In this paper, we study cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset. Directly merging the two datasets may lead to suboptimal performance due to potential dynamics mismatches. Existing approaches typically mitigate this issue through source domain transition filtering or reward modification, which, however, may lead to insufficient exploitation of the valuable source domain data. Instead, we propose to modify the source domain data into the target domain data. To that end, we leverage an inverse policy model and a reward model to correct the actions and rewards of source transitions, explicitly achieving alignment with the target dynamics. Since limited data may result in inaccurate model training, we further employ a forward dynamics model to retain corrected samples that better match the target dynamics than the original transitions. Consequently, we propose the Selective Transition Correction (STC) algorithm, which enables reliable usage of source domain data for policy adaptation. Experiments on various environments with dynamics shifts demonstrate that STC achieves superior performance against existing baselines.
翻译:在强化学习(RL)中,如何适应动态特性不匹配的跨域策略仍是一个关键挑战。本文研究跨域离线强化学习问题,其中可通过访问来自另一相似源域的离线数据集,以增强基于目标域数据集的策略学习。由于可能存在动态特性不匹配,直接合并两个数据集可能导致次优性能。现有方法通常通过源域转移过滤或奖励修正来缓解此问题,但这可能导致对宝贵源域数据的利用不足。为此,我们提出将源域数据修正为目标域数据。具体而言,我们利用逆策略模型和奖励模型来修正源转移的动作与奖励,显式实现与目标动态的对齐。由于有限数据可能导致模型训练不准确,我们进一步采用前向动态模型来保留比原始转移更匹配目标动态的修正样本。基于此,我们提出选择性转移修正(STC)算法,该算法能够可靠地利用源域数据进行策略自适应。在具有动态偏移的多种环境中的实验表明,STC相较于现有基线方法取得了更优的性能。