In many scenarios, observations from more than one sensor modality are available for reinforcement learning (RL). For example, many agents can perceive their internal state via proprioceptive sensors but must infer the environment's state from high-dimensional observations such as images. For image-based RL, a variety of self-supervised representation learning approaches exist to improve performance and sample complexity. These approaches learn the image representation in isolation. However, including proprioception can help representation learning algorithms to focus on relevant aspects and guide them toward finding better representations. Hence, in this work, we propose using Recurrent State Space Models to fuse all available sensory information into a single consistent representation. We combine reconstruction-based and contrastive approaches for training, which allows using the most appropriate method for each sensor modality. For example, we can use reconstruction for proprioception and a contrastive loss for images. We demonstrate the benefits of utilizing proprioception in learning representations for RL on a large set of experiments. Furthermore, we show that our joint representations significantly improve performance compared to a post hoc combination of image representations and proprioception.
翻译:在许多场景中,强化学习(RL)可获取来自多种传感器模态的观测数据。例如,许多智能体可通过本体感知传感器感知自身状态,但必须从图像等高维观测中推断环境状态。针对基于图像的RL,现有多种自监督表征学习方法用于提升性能与样本效率,但这些方法仅独立学习图像表征。然而,引入本体感知信息有助于表征学习算法聚焦于相关要素,并引导其发现更优表征。为此,本文提出使用递归状态空间模型将全部可用感知信息融合为单一一致表征。我们结合基于重建与基于对比的训练方法,从而为每种传感器模态选择最适配的方案:例如,对本体感知采用重建损失,对图像采用对比损失。通过大量实验,我们证明了在RL表征学习中利用本体感知信息的优势。此外,实验表明,相较于对图像表征与本体感知进行事后组合,我们的联合表征显著提升了性能。