Learning self-supervised representations using reconstruction or contrastive losses improves performance and sample complexity of image-based and multimodal reinforcement learning (RL). Here, different self-supervised loss functions have distinct advantages and limitations depending on the information density of the underlying sensor modality. Reconstruction provides strong learning signals but is susceptible to distractions and spurious information. While contrastive approaches can ignore those, they may fail to capture all relevant details and can lead to representation collapse. For multimodal RL, this suggests that different modalities should be treated differently based on the amount of distractions in the signal. We propose Contrastive Reconstructive Aggregated representation Learning (CoRAL), a unified framework enabling us to choose the most appropriate self-supervised loss for each sensor modality and allowing the representation to better focus on relevant aspects. We evaluate CoRAL's benefits on a wide range of tasks with images containing distractions or occlusions, a new locomotion suite, and a challenging manipulation suite with visually realistic distractions. Our results show that learning a multimodal representation by combining contrastive and reconstruction-based losses can significantly improve performance and solve tasks that are out of reach for more naive representation learning approaches and other recent baselines.
翻译:利用重建损失或对比损失进行自监督表示学习,能够提升基于图像及多模态的强化学习(RL)的性能并降低其样本复杂度。然而,不同的自监督损失函数根据其底层传感器模态的信息密度,具有各自独特的优势与局限。重建方法能提供强烈的学习信号,但容易受到干扰信息和虚假信息的影响。对比方法虽能忽略这些干扰,却可能无法捕捉所有相关细节,并可能导致表示坍缩。对于多模态强化学习而言,这意味着应根据信号中干扰信息的多少,对不同的模态采取差异化的处理方式。我们提出了对比重建聚合表示学习(CoRAL),这是一个统一的框架,使我们能够为每个传感器模态选择最合适的自监督损失,并让表示能更好地聚焦于相关方面。我们在包含干扰或遮挡的图像任务、一个新的运动控制任务集,以及一个具有视觉真实干扰的复杂操作任务集上广泛评估了CoRAL的优势。我们的结果表明,通过结合对比式与基于重建的损失来学习多模态表示,能够显著提升性能,并解决那些对于更朴素的表示学习方法及其他近期基线方法而言无法完成的任务。