OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

Humans possess the cognitive ability to comprehend scenes in a compositional manner. To empower AI systems with similar capabilities, object-centric learning aims to acquire representations of individual objects from visual scenes without any supervision. Although recent advances in object-centric learning have made remarkable progress on complex synthesis datasets, there is a huge challenge for application to complex real-world scenes. One of the essential reasons is the scarcity of real-world datasets specifically tailored to object-centric learning. To address this problem, we propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes, which is meticulously designed to serve as a benchmark for comparing, evaluating, and analyzing object-centric learning methods. OCTScenes contains 5000 tabletop scenes with a total of 15 objects. Each scene is captured in 60 frames covering a 360-degree perspective. Consequently, OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the evaluation of object-centric learning methods based on single-image, video, and multi-view. Extensive experiments of representative object-centric learning methods are conducted on OCTScenes. The results demonstrate the shortcomings of state-of-the-art methods for learning meaningful representations from real-world data, despite their impressive performance on complex synthesis datasets. Furthermore, OCTScenes can serve as a catalyst for the advancement of existing methods, inspiring them to adapt to real-world scenes. Dataset and code are available at https://huggingface.co/datasets/Yinxuan/OCTScenes.

翻译：人类具备以组合方式理解场景的认知能力。为赋予人工智能系统类似的能力，以对象为中心的学习旨在从视觉场景中无监督地获取单个对象的表征。尽管近期以对象为中心的学习在复杂合成数据集上取得了显著进展，但在应用于复杂真实场景时仍面临巨大挑战。其中一个关键原因在于专门针对以对象为中心学习的真实世界数据集严重匮乏。为解决这一问题，我们提出名为OCTScenes的多用途真实桌面场景数据集，该数据集经过精心设计，可作为比较、评估和分析以对象为中心学习方法的基准。OCTScenes包含5000个桌面场景，共涉及15个对象。每个场景通过60帧图像从360度视角进行采集。因此，OCTScenes作为多用途基准数据集，能够同时满足基于单图像、视频和多视角的以对象为中心学习方法的评估需求。我们在OCTScenes上对代表性以对象为中心学习方法进行了广泛实验。结果表明，尽管当前最优方法在复杂合成数据集上表现优异，但在从真实数据中学习有意义的表征方面仍存在不足。此外，OCTScenes可作为推动现有方法发展的催化剂，激励其适应真实场景。数据集和代码可通过https://huggingface.co/datasets/Yinxuan/OCTScenes获取。