Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.
翻译:深度强化学习的最新进展展示了其在处理复杂任务方面的潜力。然而,关于视觉控制任务的实验表明,最先进的强化学习模型在分布外泛化方面存在困难。相反,使用语言表达更高层次的概念和全局上下文相对容易。基于大型语言模型近期取得的成功,我们的主要目标是通过利用语言实现鲁棒的动作选择,从而改进强化学习中的状态抽象技术。具体而言,我们专注于学习基于语言的视觉特征,以增强世界模型学习(一种基于模型的强化学习技术)。为了明确验证我们的假设,我们在图像观测中遮盖部分对象的边界框,并为这些被遮盖对象提供文本提示作为描述。随后,我们预测被遮盖对象及其周围区域的像素重建,类似于基于Transformer的掩码自编码器方法。我们提出的LanGWM:语言基础世界模型在iGibson点导航任务的100K交互步数基准测试中,于分布外测试场景下取得了最先进的性能。此外,我们提出的显式语言基础视觉表示学习技术具有改进人机交互模型的潜力,因为所提取的视觉特征是基于语言的。