In real-world scenarios, the application of reinforcement learning is significantly challenged by complex non-stationarity. Most existing methods attempt to model changes in the environment explicitly, often requiring impractical prior knowledge. In this paper, we propose a new perspective, positing that non-stationarity can propagate and accumulate through complex causal relationships during state transitions, thereby compounding its sophistication and affecting policy learning. We believe that this challenge can be more effectively addressed by tracing the causal origin of non-stationarity. To this end, we introduce the Causal-Origin REPresentation (COREP) algorithm. COREP primarily employs a guided updating mechanism to learn a stable graph representation for states termed as causal-origin representation. By leveraging this representation, the learned policy exhibits impressive resilience to non-stationarity. We supplement our approach with a theoretical analysis grounded in the causal interpretation for non-stationary reinforcement learning, advocating for the validity of the causal-origin representation. Experimental results further demonstrate the superior performance of COREP over existing methods in tackling non-stationarity.
翻译:在现实场景中,强化学习的应用受到复杂非平稳性的显著挑战。现有方法大多试图显式建模环境变化,这往往需要不切实际的先验知识。本文提出一种新视角,认为非平稳性在状态转移过程中可通过复杂因果关系传播与累积,从而加剧其复杂性并影响策略学习。我们相信,通过追溯非平稳性的因果根源能更有效地应对这一挑战。为此,我们提出因果根源表征算法(COREP)。COREP主要采用引导式更新机制,学习称为"因果根源表征"的稳定状态图结构表征。借助该表征,所学策略对非平稳性展现出令人瞩目的鲁棒性。我们基于非平稳强化学习的因果解释补充了理论分析,论证了因果根源表征的有效性。实验结果进一步表明,COREP在应对非平稳性方面优于现有方法。