SegDAC：通过从预训练视觉模型中提取动态以对象为中心的表示来改进视觉强化学习 (SegDAC: Improving Visual Reinforcement Learning by Extracting Dynamic Object-Centric Representations from Pretrained Vision Models)

Visual reinforcement learning (RL) is challenging due to the need to extract useful representations from high-dimensional inputs while learning effective control from sparse and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains difficult. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground the image segmentation process via text inputs. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks. Project Page: https://segdac.github.io/

翻译：视觉强化学习（RL）面临挑战，因为它需要从高维输入中提取有用的表示，同时从稀疏且嘈杂的奖励中学习有效的控制。尽管存在大型感知模型，但将其有效集成到RL中以实现视觉泛化并提高样本效率仍然困难。我们提出了SegDAC，一种基于分割的演员-评论家方法。SegDAC使用Segment Anything（SAM）进行以对象为中心的分解，并使用YOLO-World通过文本输入来锚定图像分割过程。它包含一种新颖的基于Transformer的架构，该架构支持在每个时间步处理动态数量的分割片段，并利用在线RL有效学习应关注哪些片段，而无需使用人工标注。通过在Maniskill3这一涵盖强视觉扰动下多样化操作任务的挑战性视觉泛化基准上评估SegDAC，我们证明SegDAC实现了显著更好的视觉泛化性能，在最困难设置下的性能是先前方法的两倍，并且在所有评估任务中的样本效率均达到或超越了先前方法。项目页面：https://segdac.github.io/