Recent work has explored a range of model families for human motion generation, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion-based models. Despite their differences, many methods rely on over-parameterized input features and auxiliary losses to improve empirical results. These strategies should not be strictly necessary for diffusion models to match the human motion distribution. We show that on par with state-of-the-art results in unconditional human motion generation are achievable with a score-based diffusion model using only careful feature-space normalization and analytically derived weightings for the standard L2 score-matching loss, while generating both motion and shape directly, thereby avoiding slow post hoc shape recovery from joints. We build the method step by step, with a clear theoretical motivation for each component, and provide targeted ablations demonstrating the effectiveness of each proposed addition in isolation.
翻译:近期研究已探索了多种用于人体运动生成的模型族,包括变分自编码器(VAEs)、生成对抗网络(GANs)以及基于扩散的模型。尽管这些方法存在差异,但许多方法依赖于过度参数化的输入特征和辅助损失函数以提升实证结果。对于扩散模型而言,这些策略本非精确匹配人体运动分布所必需。我们证明,在无条件人体运动生成任务中,仅通过精细的特征空间归一化与基于解析推导的标准L2分数匹配损失权重分配,基于分数的扩散模型即可达到与当前最优方法相当的结果,同时直接生成运动与形状数据,从而避免了基于关节点的后处理形状恢复过程的低效性。我们逐步构建该方法,为每个组件提供明确的理论依据,并通过针对性消融实验独立验证了每项新增设计的有效性。