Drawing upon the intuition that aligning different modalities to the same semantic embedding space would allow models to understand states and actions more easily, we propose a new perspective to the offline reinforcement learning (RL) challenge. More concretely, we transform it into a supervised learning task by integrating multimodal and pre-trained language models. Our approach incorporates state information derived from images and action-related data obtained from text, thereby bolstering RL training performance and promoting long-term strategic thinking. We emphasize the contextual understanding of language and demonstrate how decision-making in RL can benefit from aligning states' and actions' representation with languages' representation. Our method significantly outperforms current baselines as evidenced by evaluations conducted on Atari and OpenAI Gym environments. This contributes to advancing offline RL performance and efficiency while providing a novel perspective on offline RL.Our code and data are available at https://github.com/Zheng0428/MORE_.
翻译:基于将不同模态对齐到同一语义嵌入空间能使模型更易理解状态与动作的直觉,我们为离线强化学习(RL)挑战提出了一种新视角。具体而言,通过融合多模态与预训练语言模型,我们将该问题转化为监督学习任务。该方法整合了从图像导出的状态信息与从文本获取的动作相关数据,从而提升RL训练性能并促进长期战略思维。我们强调语言的语境理解能力,并证明通过将状态与动作的表征对齐语言表征,RL决策可从中获益。在Atari与OpenAI Gym环境上的评估表明,我们的方法显著优于现有基线。这为推进离线RL的性能与效率做出了贡献,同时为离线RL提供了新思路。我们的代码与数据可在https://github.com/Zheng0428/MORE_获取。