One of the most critical aspects of multimodal Reinforcement Learning (RL) is the effective integration of different observation modalities. Having robust and accurate representations derived from these modalities is key to enhancing the robustness and sample efficiency of RL algorithms. However, learning representations in RL settings for visuotactile data poses significant challenges, particularly due to the high dimensionality of the data and the complexity involved in correlating visual and tactile inputs with the dynamic environment and task objectives. To address these challenges, we propose Multimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Our approach employs a novel multimodal self-supervised learning technique that learns efficient representations and contributes to faster convergence of RL algorithms. Our method is agnostic to the RL algorithm, thus enabling its integration with any available RL algorithm. We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation tasks. This is evidenced by faster convergence rates and higher cumulative rewards per episode, compared to standard RL algorithms without our representation learning approach.
翻译:多模态强化学习最关键的方面之一在于有效整合不同的观测模态。从这些模态中获取稳健且准确的表征,是提升强化学习算法鲁棒性和样本效率的关键。然而,在强化学习环境中为视觉触觉数据学习表征面临着重大挑战,这尤其源于数据的高维性,以及将视觉和触觉输入与动态环境和任务目标相关联的复杂性。为应对这些挑战,我们提出了多模态对比无监督强化学习。我们的方法采用了一种新颖的多模态自监督学习技术,该技术能够学习高效的表征,并有助于强化学习算法更快地收敛。我们的方法与具体的强化学习算法无关,因此可以与任何现有的强化学习算法集成。我们在Tactile Gym 2模拟器上评估了M2CURL,结果表明,它在不同的操作任务中显著提高了学习效率。与未采用我们表征学习方法的标准强化学习算法相比,这体现在更快的收敛速度和每回合更高的累积奖励上。