Playing an important role in Model-Based Reinforcement Learning (MBRL), environment models aim to predict future states based on the past. Existing works usually ignore instantaneous dependence in the state, that is, assuming that the future state variables are conditionally independent given the past states. However, instantaneous dependence is prevalent in many RL environments. For instance, in the stock market, instantaneous dependence can exist between two stocks because the fluctuation of one stock can quickly affect the other and the resolution of price change is lower than that of the effect. In this paper, we prove that with few exceptions, ignoring instantaneous dependence can result in suboptimal policy learning in MBRL. To address the suboptimality problem, we propose a simple plug-and-play method to enable existing MBRL algorithms to take instantaneous dependence into account. Through experiments on two benchmarks, we (1) confirm the existence of instantaneous dependence with visualization; (2) validate our theoretical findings that ignoring instantaneous dependence leads to suboptimal policy; (3) verify that our method effectively enables reinforcement learning with instantaneous dependence and improves policy performance.
翻译:在基于模型的强化学习中,环境模型旨在基于过去状态预测未来状态,发挥着重要作用。现有研究通常忽略状态中的瞬时依赖性,即假设未来状态变量在给定过去状态的条件下是条件独立的。然而,瞬时依赖性在许多强化学习环境中普遍存在。例如,在股票市场中,两只股票之间可能存在瞬时依赖性,因为一只股票的波动可能迅速影响另一只股票,且价格变化的分辨率低于这种影响的速度。本文证明,除少数情况外,忽略瞬时依赖性会导致基于模型的强化学习中的策略学习次优。为解决次优性问题,我们提出了一种简单的即插即用方法,使现有的基于模型的强化学习算法能够考虑瞬时依赖性。通过两个基准实验,我们(1)通过可视化确认了瞬时依赖性的存在;(2)验证了我们关于忽略瞬时依赖性导致策略次优的理论发现;(3)证实了我们的方法能有效实现考虑瞬时依赖性的强化学习,并提升策略性能。