NCHO: Unsupervised Learning for Neural 3D Composition of Humans and Objects

Deep generative models have been recently extended to synthesizing 3D digital humans. However, previous approaches treat clothed humans as a single chunk of geometry without considering the compositionality of clothing and accessories. As a result, individual items cannot be naturally composed into novel identities, leading to limited expressiveness and controllability of generative 3D avatars. While several methods attempt to address this by leveraging synthetic data, the interaction between humans and objects is not authentic due to the domain gap, and manual asset creation is difficult to scale for a wide variety of objects. In this work, we present a novel framework for learning a compositional generative model of humans and objects (backpacks, coats, scarves, and more) from real-world 3D scans. Our compositional model is interaction-aware, meaning the spatial relationship between humans and objects, and the mutual shape change by physical contact is fully incorporated. The key challenge is that, since humans and objects are in contact, their 3D scans are merged into a single piece. To decompose them without manual annotations, we propose to leverage two sets of 3D scans of a single person with and without objects. Our approach learns to decompose objects and naturally compose them back into a generative human model in an unsupervised manner. Despite our simple setup requiring only the capture of a single subject with objects, our experiments demonstrate the strong generalization of our model by enabling the natural composition of objects to diverse identities in various poses and the composition of multiple objects, which is unseen in training data.

翻译：深度生成模型近期已扩展至合成三维数字人体。然而，现有方法将着装人体视为单一几何块，未考虑衣物与配饰的组合性。由此，个体物品无法自然组合为新型身份，导致生成式三维虚拟角色的表现力与可控性受限。尽管部分研究尝试通过利用合成数据解决该问题，但领域差异使得人体与物体交互缺乏真实性，且手工资产创建难以规模化适应多样物品。本文提出一种新型框架，从真实世界三维扫描中学习人体与物体（背包、外套、围巾等）的组合式生成模型。我们的组合模型具有交互感知特性——即充分整合了人体与物体的空间关系及物理接触导致的相互形状变化。核心挑战在于：由于人体与物体存在接触，其三维扫描数据融合为单一整体。为实现无需人工标注的分解，我们创新性地利用同一主体有无物体时的两组三维扫描数据。该方法以无监督方式学习物体分解，并将其自然组合至生成式人体模型中。尽管仅需采集单个主体携带物体的简单设置，实验表明我们的模型展现出强泛化能力：能实现物体与不同姿态多样身份的自然组合，以及训练数据中未出现的多物体组合。