Incorporating pre-collected offline data can substantially improve the sample efficiency of reinforcement learning (RL), but its benefits can break down when the transition dynamics in the offline dataset differ from those encountered online. Existing approaches typically mitigate this issue by penalizing or filtering offline transitions in regions with large dynamics gap. However, their dynamics-gap estimators often rely on KL divergence or mutual information, which can be ill-defined when offline and online dynamics have mismatched support. To address this challenge, we propose CompFlow, a principled framework built on the theoretical connection between flow matching and optimal transport. Specifically, we model the online dynamics as a conditional flow built upon the output distribution of a pretrained offline flow, rather than learning it directly from a Gaussian prior. This composite structure provides two advantages: (1) improved generalization when learning online dynamics under limited interaction data, and (2) a well-defined and stable estimate of the dynamics gap via the Wasserstein distance between offline and online transitions. Building on this dynamics-gap estimator, we further develop an optimistic active data collection strategy that prioritizes exploration in high-gap regions, and show theoretically that it reduces the performance gap to the optimal policy. Empirically, CompFlow consistently outperforms strong baselines across a range of RL benchmarks with shifted-dynamics data.
翻译:利用预先收集的离线数据可显著提升强化学习的样本效率,但当离线数据集中的状态转移动态与在线环境存在差异时,其优势可能失效。现有方法通常通过对动态差异较大区域的离线转移施加惩罚或过滤来缓解此问题,但其动态差异估计器往往依赖KL散度或互信息,当离线与在线动态的支持集不匹配时,这些度量可能无法准确定义。为解决这一挑战,我们提出CompFlow——一个基于流匹配与最优传输理论联系构建的原理性框架。具体而言,我们将在线动态建模为基于预训练离线流输出分布构建的条件流,而非直接从高斯先验中学习。这种复合结构具有双重优势:(1)在有限交互数据下学习在线动态时获得更好的泛化能力;(2)通过离线与在线转移之间的Wasserstein距离提供明确定义且稳定的动态差异估计。基于此动态差异估计器,我们进一步开发了一种乐观主动数据收集策略,该策略优先探索高动态差异区域,并从理论上证明其能缩小与最优策略的性能差距。在多个含动态偏移数据的强化学习基准测试中,CompFlow均持续优于现有强基线方法。