It is important for deep reinforcement learning (DRL) algorithms to transfer their learned policies to new environments that have different visual inputs. In this paper, we introduce Prompt based Proximal Policy Optimization ($P^{3}O$), a three-stage DRL algorithm that transfers visual representations from a target to a source environment by applying prompting. The process of $P^{3}O$ consists of three stages: pre-training, prompting, and predicting. In particular, we specify a prompt-transformer for representation conversion and propose a two-step training process to train the prompt-transformer for the target environment, while the rest of the DRL pipeline remains unchanged. We implement $P^{3}O$ and evaluate it on the OpenAI CarRacing video game. The experimental results show that $P^{3}O$ outperforms the state-of-the-art visual transferring schemes. In particular, $P^{3}O$ allows the learned policies to perform well in environments with different visual inputs, which is much more effective than retraining the policies in these environments.
翻译:深度强化学习算法将其学习到的策略迁移到具有不同视觉输入的新环境中具有重要意义。本文提出基于提示的近端策略优化算法($P^{3}O$),这是一种通过应用提示技术将视觉表示从目标环境迁移到源环境的三阶段深度强化学习算法。$P^{3}O$的处理流程包含三个阶段:预训练、提示与预测。具体而言,我们设计了用于表示转换的提示转换器,并提出两阶段训练流程来训练目标环境的提示转换器,而深度强化学习流水线的其余部分保持不变。我们实现了$P^{3}O$并在OpenAI CarRacing视频游戏中进行了评估。实验结果表明,$P^{3}O$的性能优于最先进的视觉迁移方案。特别地,$P^{3}O$使已学习策略能够在具有不同视觉输入的环境中表现良好,其效果远优于在这些环境中重新训练策略。