Visual reinforcement learning policies trained on pixel observations often struggle to generalize when visual conditions change at test time. Object-centric representations are a promising alternative, but most approaches use fixed-size slot representations, require image reconstruction, or need auxiliary losses to learn object decompositions. As a result, it remains unclear how to learn RL policies directly from object-level inputs without these constraints. We propose SegDAC, a Segmentation-Driven Actor-Critic that operates on a variable-length set of object token embeddings. At each timestep, text-grounded segmentation produces object masks from which spatially aware token embeddings are extracted. A transformer-based actor-critic processes these dynamic tokens, using segment positional encoding to preserve spatial information across objects. We ablate these design choices and show that both segment positional encoding and variable-length processing are individually necessary for strong performance. We evaluate SegDAC on 8 ManiSkill3 manipulation tasks under 12 visual perturbation types across 3 difficulty levels. SegDAC improves over prior visual generalization methods by 15% on easy, 66% on medium, and 88% on the hardest settings. SegDAC matches the sample efficiency of the state-of-the-art visual RL methods while achieving improved generalization under visual changes. Project Page: https://segdac.github.io/
翻译:在像素观测上训练的视觉强化学习策略,在测试时视觉条件发生变化时,往往难以有效泛化。以对象为中心的表示是一种有前景的替代方案,但大多数方法使用固定大小的槽表示、需要图像重建,或需要辅助损失来学习对象分解。因此,如何在没有这些限制的情况下直接从对象级输入学习强化学习策略,仍不明确。我们提出了SegDAC,一种基于分割的演员-评论家模型,它处理一个可变长度的对象令牌嵌入集合。在每个时间步,基于文本的分割从图像中生成对象掩码,并从中提取具有空间感知的令牌嵌入。一个基于Transformer的演员-评论家模型处理这些动态令牌,并使用分割位置编码来保持跨对象的空间信息。我们通过消融实验验证了这些设计选择,并表明分割位置编码和可变长度处理对于实现强大性能都是各自必需的。我们在ManiSkill3的8个操作任务上,于3个难度级别下的12种视觉扰动类型下评估了SegDAC。SegDAC在简单、中等和最困难设置上的性能,分别比先前的视觉泛化方法提高了15%、66%和88%。SegDAC在样本效率上与最先进的视觉强化学习方法相当,同时在视觉变化下实现了更好的泛化能力。项目页面:https://segdac.github.io/