Segmenting humans in 3D indoor scenes has become increasingly important with the rise of human-centered robotics and AR/VR applications. To this end, we propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation. Few works have attempted to directly segment humans in cluttered 3D scenes, which is largely due to the lack of annotated training data of humans interacting with 3D scenes. We address this challenge and propose a framework for generating training data of synthetic humans interacting with real 3D scenes. Furthermore, we propose a novel transformer-based model, Human3D, which is the first end-to-end model for segmenting multiple human instances and their body-parts in a unified manner. The key advantage of our synthetic data generation framework is its ability to generate diverse and realistic human-scene interactions, with highly accurate ground truth. Our experiments show that pre-training on synthetic data improves performance on a wide variety of 3D human segmentation tasks. Finally, we demonstrate that Human3D outperforms even task-specific state-of-the-art 3D segmentation methods.
翻译:随着以人为中心的机器人技术和增强现实/虚拟现实应用的兴起,三维室内场景中的人体分割变得日益重要。为此,我们提出了联合三维人体语义分割、实例分割及多人身体部位分割的任务。现有研究鲜有尝试在杂乱的三维场景中直接分割人体,这主要归因于缺乏人体与三维场景交互的标注训练数据。我们针对这一挑战提出了一个生成合成人体与真实三维场景交互训练数据的框架。此外,我们提出了一种基于Transformer的新型模型Human3D,这是第一个以统一方式分割多个行人实例及其身体部位的端到端模型。我们的合成数据生成框架的关键优势在于能够生成多样化且真实的人-场景交互,并配备高度精确的标注真值。实验表明,在多种三维人体分割任务上,基于合成数据的预训练能够有效提升性能。最终,我们证明Human3D在三维分割方法中不仅优于通用方法,甚至超越了专门面向特定任务的当前最优方法。