Unsupervised object-centric representation (OCR) learning has recently drawn attention as a new paradigm of visual representation. This is because of its potential of being an effective pre-training technique for various downstream tasks in terms of sample efficiency, systematic generalization, and reasoning. Although image-based reinforcement learning (RL) is one of the most important and thus frequently mentioned such downstream tasks, the benefit in RL has surprisingly not been investigated systematically thus far. Instead, most of the evaluations have focused on rather indirect metrics such as segmentation quality and object property prediction accuracy. In this paper, we investigate the effectiveness of OCR pre-training for image-based reinforcement learning via empirical experiments. For systematic evaluation, we introduce a simple object-centric visual RL benchmark and conduct experiments to answer questions such as ``Does OCR pre-training improve performance on object-centric tasks?'' and ``Can OCR pre-training help with out-of-distribution generalization?''. Our results provide empirical evidence for valuable insights into the effectiveness of OCR pre-training for RL and the potential limitations of its use in certain scenarios. Additionally, this study also examines the critical aspects of incorporating OCR pre-training in RL, including performance in a visually complex environment and the appropriate pooling layer to aggregate the object representations.
翻译:无监督对象中心表示学习最近作为一种视觉表示的新范式引起了关注,这是因为其在样本效率、系统泛化和推理方面具有成为各种下游任务有效预训练技术的潜力。尽管基于图像的强化学习是最重要且经常被提及的下游任务之一,但令人惊讶的是,目前尚未系统性地研究其在强化学习中的益处。相反,大多数评估集中在较为间接的指标上,如分割质量和对象属性预测准确性。在本文中,我们通过实证实验研究了面向图像强化学习的对象中心表示预训练的有效性。为进行系统评估,我们引入了一个简单的以对象为中心的视觉强化学习基准,并通过实验回答诸如“对象中心表示预训练是否能提高以对象为中心的任务的性能?”和“对象中心表示预训练是否有助于分布外泛化?”等问题。我们的结果为理解对象中心表示预训练对强化学习的有效性及其在某些场景中使用的潜在局限性提供了有价值的实证依据。此外,本研究还探讨了将对象中心表示预训练整合到强化学习中的关键方面,包括在视觉复杂环境中的性能以及用于聚合对象表示的适当池化层。