Sequential Action-Induced Invariant Representation for Reinforcement Learning

How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a realistic and challenging problem in visual reinforcement learning. Recently, unsupervised representation learning methods based on bisimulation metrics, contrast, prediction, and reconstruction have shown the ability for task-relevant information extraction. However, due to the lack of appropriate mechanisms for the extraction of task information in the prediction, contrast, and reconstruction-related approaches and the limitations of bisimulation-related methods in domains with sparse rewards, it is still difficult for these methods to be effectively extended to environments with distractions. To alleviate these problems, in the paper, the action sequences, which contain task-intensive signals, are incorporated into representation learning. Specifically, we propose a Sequential Action--induced invariant Representation (SAR) method, in which the encoder is optimized by an auxiliary learner to only preserve the components that follow the control signals of sequential actions, so the agent can be induced to learn the robust representation against distractions. We conduct extensive experiments on the DeepMind Control suite tasks with distractions while achieving the best performance over strong baselines. We also demonstrate the effectiveness of our method at disregarding task-irrelevant information by deploying SAR to real-world CARLA-based autonomous driving with natural distractions. Finally, we provide the analysis results of generalization drawn from the generalization decay and t-SNE visualization. Code and demo videos are available at https://github.com/DMU-XMU/SAR.git.

翻译：如何从带有视觉干扰的高维观测中准确学习任务相关的状态表征，是视觉强化学习中一个现实且具有挑战性的问题。近年来，基于对偶度量、对比学习、预测和重建的无监督表征学习方法已展现出提取任务相关信息的能力。然而，由于预测、对比和重建相关方法缺乏任务信息提取的适当机制，以及对偶度量相关方法在稀疏奖励领域的局限性，这些方法仍难以有效推广至存在干扰的环境。为缓解这些问题，本文在表征学习中融入了包含密集任务信号的时序动作。具体而言，我们提出了一种时序动作诱导的不变表征（SAR）方法，其中编码器通过一个辅助学习器进行优化，仅保留遵循时序动作控制信号的成分，从而引导智能体学习鲁棒于干扰的表征。我们在带有干扰的DeepMind Control套件任务上进行了大量实验，并在强基线方法中取得了最佳性能。通过将SAR部署到基于CARLA的真实自动驾驶场景（含自然干扰），我们进一步证明了该方法忽略任务无关信息的有效性。最后，我们提供了从泛化衰减和t-SNE可视化中得出的泛化分析结果。代码与演示视频见https://github.com/DMU-XMU/SAR.git。