We present a 3D shape completion method that recovers the complete geometry of multiple objects in complex scenes from a single RGB-D image. Despite notable advancements in single object 3D shape completion, high-quality reconstructions in highly cluttered real-world multi-object scenes remains a challenge. To address this issue, we propose OctMAE, an architecture that leverages an Octree U-Net and a latent 3D MAE to achieve high-quality and near real-time multi-object shape completion through both local and global geometric reasoning. Because a na\"ive 3D MAE can be computationally intractable and memory intensive even in the latent space, we introduce a novel occlusion masking strategy and adopt 3D rotary embeddings, which significantly improves the runtime and shape completion quality. To generalize to a wide range of objects in diverse scenes, we create a large-scale photorealistic dataset, featuring a diverse set of 12K 3D object models from the Objaverse dataset which are rendered in multi-object scenes with physics-based positioning. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets and demonstrates a strong zero-shot capability.
翻译:我们提出了一种三维形状补全方法,该方法能够从单张RGB-D图像中恢复复杂场景中多个对象的完整几何结构。尽管单对象三维形状补全取得了显著进步,但在高度杂乱的真实世界多对象场景中实现高质量重建仍然是一个挑战。为了解决这一问题,我们提出了OctMAE架构,该架构结合了八叉树U-Net和潜在三维掩码自编码器(MAE),通过局部和全局几何推理实现高质量且近乎实时的多对象形状补全。由于原始的3D MAE即使在潜在空间中也可能存在计算不可行和内存占用高的问题,我们引入了一种新颖的遮挡掩码策略,并采用三维旋转嵌入,显著提升了运行时间和形状补全质量。为了泛化到不同场景中的广泛对象,我们创建了一个大规模逼真数据集,该数据集包含来自Objaverse数据集的12K个多样化三维对象模型,并在基于物理定位的多对象场景中进行渲染。我们的方法在合成和真实世界数据集上均优于当前最先进的方法,并展现出强大的零样本能力。