We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.
翻译:本文研究在线迁移强化学习(RL)问题,该问题在片段式马尔可夫决策过程中进行,其中在目标任务学习期间可获得来自相关源任务的经验。一个根本性困难在于:任务相似性通常依据奖励或转移概率来定义,而在线RL算法则基于贝尔曼回归目标进行操作。因此,直接复用源贝尔曼更新会引入系统性偏差并使遗憾保证失效。我们识别出单步贝尔曼对齐作为在线RL中迁移的正确抽象,并提出重加权目标修正(RWT)——一种算子层面的校正方法,该方法通过测度变换重新设定延续值并补偿转移概率失配。RWT将任务失配简化为固定的单步校正,并支持对源数据进行统计意义上可靠的复用。这种对齐产生了一个两阶段RWT $Q$-学习框架,将方差缩减与偏差校正分离。在RKHS函数逼近下,我们建立了遗憾界,其缩放尺度取决于任务迁移的复杂度而非目标MDP的复杂度。在表格设置和神经网络设置中的实证结果表明,相较于单任务学习和简单池化方法,该方法均取得了一致的改进,凸显了贝尔曼对齐作为在线RL中一种模型无关的迁移原则。