Text-to-motion generation is a crucial task in computer vision, which generates the target 3D motion by the given text. The existing annotated datasets are limited in scale, resulting in most existing methods overfitting to the small datasets and unable to generalize to the motions of the open domain. Some methods attempt to solve the open-vocabulary motion generation problem by aligning to the CLIP space or using the Pretrain-then-Finetuning paradigm. However, the current annotated dataset's limited scale only allows them to achieve mapping from sub-text-space to sub-motion-space, instead of mapping between full-text-space and full-motion-space (full mapping), which is the key to attaining open-vocabulary motion generation. To this end, this paper proposes to leverage the atomic motion (simple body part motions over a short time period) as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem. For Textual Decomposition, we design a fine-grained description conversion algorithm, and combine it with the generalization ability of a large language model to convert any given motion text into atomic texts. Sub-motion-space Scattering learns the compositional process from atomic motions to the target motions, to make the learned sub-motion-space scattered to form the full-motion-space. For a given motion of the open domain, it transforms the extrapolation into interpolation and thereby significantly improves generalization. Our network, $DSO$-Net, combines textual $d$ecomposition and sub-motion-space $s$cattering to solve the $o$pen-vocabulary motion generation. Extensive experiments demonstrate that our DSO-Net achieves significant improvements over the state-of-the-art methods on open-vocabulary motion generation. Code is available at https://vankouf.github.io/DSONet/.
翻译:文本到运动生成是计算机视觉中的一项关键任务,其目标是根据给定文本生成对应的三维运动。现有标注数据集的规模有限,导致大多数现有方法在小数据集上过拟合,无法泛化到开放领域的运动。一些方法尝试通过对齐到CLIP空间或采用预训练-微调范式来解决开放词汇运动生成问题。然而,当前标注数据集的有限规模仅允许它们实现从子文本空间到子运动空间的映射,而非全文本空间与全运动空间之间的映射(完整映射),而后者是实现开放词汇运动生成的关键。为此,本文提出利用原子运动(短时间内的简单身体部位运动)作为中间表示,并采用两个有序耦合的步骤——文本分解与子运动空间散射——来解决完整映射问题。对于文本分解,我们设计了一种细粒度描述转换算法,并结合大语言模型的泛化能力,将任意给定的运动文本转换为原子文本。子运动空间散射学习从原子运动到目标运动的组合过程,使学习到的子运动空间散射形成全运动空间。对于开放领域的给定运动,该方法将外推转化为内插,从而显著提升泛化能力。我们的网络$DSO$-Net结合了文本$d$ecomposition与子运动空间$s$cattering以解决$o$pen-vocabulary运动生成问题。大量实验表明,我们的DSO-Net在开放词汇运动生成任务上相比现有最先进方法取得了显著提升。代码发布于https://vankouf.github.io/DSONet/。