In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.
翻译:在基于模型的强化学习中,转移矩阵和奖励向量通常从受噪声影响的随机样本中估计得到。即使估计模型是真实底层模型的无偏估计,由估计模型计算出的价值函数仍存在偏差。我们引入了一种算子移位方法来减少估计模型引入的误差。当误差处于残差范数层面时,我们证明了移位因子恒为正数且上界为 $1+O\left(1/n\right)$,其中 $n$ 是用于学习转移矩阵每行的样本数量。我们还提出了一种实现算子移位的实用数值算法。