It is important for deep reinforcement learning (DRL) algorithms to transfer their learned policies to new environments that have different visual inputs. In this paper, we introduce Prompt based Proximal Policy Optimization ($P^{3}O$), a three-stage DRL algorithm that transfers visual representations from a target to a source environment by applying prompting. The process of $P^{3}O$ consists of three stages: pre-training, prompting, and predicting. In particular, we specify a prompt-transformer for representation conversion and propose a two-step training process to train the prompt-transformer for the target environment, while the rest of the DRL pipeline remains unchanged. We implement $P^{3}O$ and evaluate it on the OpenAI CarRacing video game. The experimental results show that $P^{3}O$ outperforms the state-of-the-art visual transferring schemes. In particular, $P^{3}O$ allows the learned policies to perform well in environments with different visual inputs, which is much more effective than retraining the policies in these environments.
翻译:深度强化学习算法将其习得的策略迁移至具有不同视觉输入的新环境中具有重要意义。本文提出基于提示的近端策略优化算法($P^{3}O$),这是一种通过应用提示将视觉表征从目标环境迁移至源环境的三阶段深度强化学习算法。$P^{3}O$的过程包含三个关键阶段:预训练、提示学习与预测。具体而言,我们定义了一个提示变换器(prompt-transformer)用于表征转换,并提出两阶段训练流程以在目标环境中训练该提示变换器,而深度强化学习其余流程保持不变。我们在OpenAI CarRacing视频游戏上实现并评估了$P^{3}O$。实验结果表明,$P^{3}O$性能优于当前最先进的视觉迁移方案。特别值得注意的是,$P^{3}O$使习得策略能在具有不同视觉输入的环境中表现良好,其效果远超在这些环境中重新训练策略的方法。