We present a novel framework for 3D object-centric representation learning. Our approach effectively decomposes complex scenes into individual objects from a single image in an unsupervised fashion. This method, called slot-guided Volumetric Object Radiance Fields (sVORF), composes volumetric object radiance fields with object slots as a guidance to implement unsupervised 3D scene decomposition. Specifically, sVORF obtains object slots from a single image via a transformer module, maps these slots to volumetric object radiance fields with a hypernetwork and composes object radiance fields with the guidance of object slots at a 3D location. Moreover, sVORF significantly reduces memory requirement due to small-sized pixel rendering during training. We demonstrate the effectiveness of our approach by showing top results in scene decomposition and generation tasks of complex synthetic datasets (e.g., Room-Diverse). Furthermore, we also confirm the potential of sVORF to segment objects in real-world scenes (e.g., the LLFF dataset). We hope our approach can provide preliminary understanding of the physical world and help ease future research in 3D object-centric representation learning.
翻译:我们提出了一种新颖的3D对象中心表示学习框架。该方法以无监督方式,从单张图像中有效将复杂场景分解为独立对象。本方法名为插槽引导的体素对象辐射场(sVORF),以对象插槽为引导,组合体素对象辐射场来实现无监督3D场景分解。具体而言,sVORF通过Transformer模块从单张图像获取对象插槽,利用超网络将这些插槽映射为体素对象辐射场,并在3D位置以对象插槽为引导组合对象辐射场。此外,由于训练期间采用小尺寸像素渲染,sVORF显著降低了内存需求。我们在复杂合成数据集(如Room-Diverse)的场景分解与生成任务中展示了领先结果,验证了方法的有效性。同时,我们也确认了sVORF在现实场景(如LLFF数据集)中分割对象的潜力。希望本方法能提供对物理世界的初步理解,并助力未来3D对象中心表示学习的研究。