Learning policies that can generalize to unseen environments is a fundamental challenge in visual reinforcement learning (RL). While most current methods focus on acquiring robust visual representations through auxiliary supervision, pre-training, or data augmentation, the potential of modern vision foundation models remains underleveraged. In this work, we introduce Segment Anything Model for Generalizable visual RL (SAM-G), a novel framework that leverages the promptable segmentation ability of Segment Anything Model (SAM) to enhance the generalization capabilities of visual RL agents. We utilize image features from DINOv2 and SAM to find correspondence as point prompts to SAM, and then SAM produces high-quality masked images for agents directly. Evaluated across 8 DMControl tasks and 3 Adroit tasks, SAM-G significantly improves the visual generalization ability without altering the RL agents' architecture but merely their observations. Notably, SAM-G achieves 44% and 29% relative improvements on the challenging video hard setting on DMControl and Adroit respectively, compared to state-of-the-art methods. Video and code: https://yanjieze.com/SAM-G/
翻译:学习能够泛化到未见环境的策略是视觉强化学习(RL)中的基本挑战。尽管当前大多数方法通过辅助监督、预训练或数据增强来获取鲁棒的视觉表征,但现代视觉基础模型的潜力仍未得到充分利用。本文提出了通用化视觉强化学习的Segment Anything模型(SAM-G),这是一种利用Segment Anything模型(SAM)的可提示分割能力来增强视觉RL智能体泛化能力的新框架。我们利用DINOv2和SAM的图像特征找到对应关系作为SAM的点提示,随后SAM直接为智能体生成高质量的掩码图像。在8个DMControl任务和3个Adroit任务上的评估表明,SAM-G在无需改变RL智能体架构、仅需修改其观测的情况下,显著提升了视觉泛化能力。值得注意的是,与现有最先进方法相比,SAM-G在DMControl和Adroit的挑战性视频硬设置中分别实现了44%和29%的相对性能提升。视频与代码:https://yanjieze.com/SAM-G/