Spotlighting Task-Relevant Features: Object-Centric Representations for Better Generalization in Robotic Manipulation

The generalization capabilities of robotic manipulation policies are heavily influenced by the choice of visual representations. Existing approaches typically rely on representations extracted from pre-trained encoders, using two dominant types of features: global features, which summarize an entire image via a single pooled vector, and dense features, which preserve a patch-wise embedding from the final encoder layer. While widely used, both feature types mix task-relevant and irrelevant information, leading to poor generalization under distribution shifts, such as changes in lighting, textures, or the presence of distractors. In this work, we explore an intermediate structured alternative: Slot-Based Object-Centric Representations (SBOCR), which group dense features into a finite set of object-like entities. This representation permits to naturally reduce the noise provided to the robotic manipulation policy while keeping enough information to efficiently perform the task. We benchmark a range of global and dense representations against intermediate slot-based representations, across a suite of simulated and real-world manipulation tasks ranging from simple to complex. We evaluate their generalization under diverse visual conditions, including changes in lighting, texture, and the presence of distractors. Our findings reveal that SBOCR-based policies outperform dense and global representation-based policies in generalization settings, even without task-specific pretraining. These insights suggest that SBOCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.

翻译：机器人操作策略的泛化能力在很大程度上受视觉表示选择的影响。现有方法通常依赖于从预训练编码器中提取的表示，主要使用两种特征类型：全局特征（通过单一池化向量汇总整幅图像）和密集特征（保留最终编码器层的块级嵌入）。尽管广泛应用，这两种特征类型均混合了任务相关与无关信息，导致在光照、纹理变化或存在干扰物等分布偏移下泛化性能较差。本文探索一种中间结构化替代方案：基于槽位的对象中心表示（SBOCR），它将密集特征分组为有限的对象类实体集合。这种表示能够自然地减少提供给机器人操作策略的噪声，同时保留足够信息以高效执行任务。我们在从简单到复杂的一系列仿真与真实世界操作任务中，对多种全局和密集表示与基于槽位的中间表示进行了基准测试。我们评估了它们在光照、纹理变化及存在干扰物等多种视觉条件下的泛化性能。研究结果表明，即使在无任务特定预训练的情况下，基于SBOCR的策略在泛化场景中仍优于基于密集和全局表示的策略。这些发现表明，SBOCR为设计能在动态真实世界机器人环境中有效泛化的视觉系统提供了有前景的研究方向。