One of the fundamental challenges associated with reinforcement learning (RL) is that collecting sufficient data can be both time-consuming and expensive. In this paper, we formalize a concept of time reversal symmetry in a Markov decision process (MDP), which builds upon the established structure of dynamically reversible Markov chains (DRMCs) and time-reversibility in classical physics. Specifically, we investigate the utility of this concept in reducing the sample complexity of reinforcement learning. We observe that utilizing the structure of time reversal in an MDP allows every environment transition experienced by an agent to be transformed into a feasible reverse-time transition, effectively doubling the number of experiences in the environment. To test the usefulness of this newly synthesized data, we develop a novel approach called time symmetric data augmentation (TSDA) and investigate its application in both proprioceptive and pixel-based state within the realm of off-policy, model-free RL. Empirical evaluations showcase how these synthetic transitions can enhance the sample efficiency of RL agents in time reversible scenarios without friction or contact. We also test this method in more realistic environments where these assumptions are not globally satisfied. We find that TSDA can significantly degrade sample efficiency and policy performance, but can also improve sample efficiency under the right conditions. Ultimately we conclude that time symmetry shows promise in enhancing the sample efficiency of reinforcement learning and provide guidance when the environment and reward structures are of an appropriate form for TSDA to be employed effectively.
翻译:强化学习(RL)面临的一个基本挑战是收集足够的数据既耗时又昂贵。本文基于动态可逆马尔可夫链(DRMC)和经典物理学中的时间可逆性结构,形式化了马尔可夫决策过程(MDP)中时间反演对称性的概念。具体而言,我们研究了该概念在降低强化学习样本复杂度方面的效用。我们发现,利用MDP中的时间反演结构,智能体经历的每个环境转换都可以转化为可行的反向时间转换,从而有效倍增其环境经验量。为检验这些新合成数据的有效性,我们提出了一种称为时间对称数据增强(TSDA)的新方法,并研究了其在基于本体感受和像素状态的无模型离策略RL中的应用。实验评估表明,在无摩擦或无接触的时间可逆场景中,这些合成转换能够提升RL智能体的样本效率。我们还在这些假设不完全成立的更真实环境中测试了该方法,发现TSDA可能显著降低样本效率和策略性能,但在合适条件下也能提升样本效率。最终我们得出结论:时间对称性在提升强化学习样本效率方面具有潜力,并为环境与奖励结构适合有效应用TSDA时提供了指导。