We present a novel method to generate human motion to populate 3D indoor scenes. It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds. State-of-the-art methods are either models specialized to one single setting, require vast amounts of high-quality and diverse training data, or are unconditional models that do not integrate scene or other contextual information. As a consequence, they have limited applicability and rely on costly training data. To address these limitations, we propose a new method ,dubbed Purposer, based on neural discrete representation learning. Our model is capable of exploiting, in a flexible manner, different types of information already present in open access large-scale datasets such as AMASS. First, we encode unconditional human motion into a discrete latent space. Second, an autoregressive generative model, conditioned with key contextual information, either with prompting or additive tokens, and trained for next-step prediction in this space, synthesizes sequences of latent indices. We further design a novel conditioning block to handle future conditioning information in such a causal model by using a network with two branches to compute separate stacks of features. In this manner, Purposer can generate realistic motion sequences in diverse test scenes. Through exhaustive evaluation, we demonstrate that our multi-contextual solution outperforms existing specialized approaches for specific contextual information, both in terms of quality and diversity. Our model is trained with short sequences, but a byproduct of being able to use various conditioning signals is that at test time different combinations can be used to chain short sequences together and generate long motions within a context scene.
翻译:摘要:我们提出了一种新颖的方法,用于在三维室内场景中生成人体运动。该方法可通过多种条件信号的组合进行控制,例如场景中的路径、目标姿态、过往运动以及以三维点云形式呈现的场景。现有最先进方法要么是针对单一场景特化的模型,要么需要大量高质量且多样化的训练数据,要么是无法整合场景或其他上下文信息的无条件模型。因此,这些方法应用范围有限,且依赖昂贵的训练数据。为克服这些限制,我们提出了一种基于神经离散表征学习的新方法,命名为Purposer。我们的模型能够灵活利用来自开放获取大规模数据集(如AMASS)中已有的不同类型信息。首先,我们将无条件人体运动编码到离散潜在空间中。其次,一个以关键上下文信息(通过提示或附加令牌)为条件的自回归生成模型,在该空间中进行下一步预测训练,从而合成潜在索引序列。我们进一步设计了一种新颖的条件处理模块,通过采用双分支网络计算独立的特征堆栈,使得此类因果模型能够处理未来条件信息。通过这种方式,Purposer能够在多样化的测试场景中生成逼真的运动序列。通过详尽的评估,我们证明在特定上下文信息方面,我们的多情境解决方案在质量和多样性上均优于现有的专门方法。我们的模型使用短序列进行训练,但由于能够利用多种条件信号,在测试时可以组合不同信号将短序列串联起来,在情境场景中生成较长的运动。