Text-to-motion generation is an emerging and challenging problem, which aims to synthesize motion with the same semantics as the input text. However, due to the lack of diverse labeled training data, most approaches either limit to specific types of text annotations or require online optimizations to cater to the texts during inference at the cost of efficiency and stability. In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner that neither requires paired training data nor extra online optimization to adapt for unseen texts. Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion. During inference, instead of changing the motion generator, our method reformulates the input text into a masked motion as the prompt for the motion generator to ``reconstruct'' the motion. In constructing the prompt, the unmasked poses of the prompt are synthesized by a text-to-pose generator. To supervise the optimization of the text-to-pose generator, we propose the first text-pose alignment model for measuring the alignment between texts and 3D poses. And to prevent the pose generator from overfitting to limited training texts, we further propose a novel wordless training mechanism that optimizes the text-to-pose generator without any training texts. The comprehensive experimental results show that our method obtains a significant improvement against the baseline methods. The code is available.
翻译:文本到动作生成是一个新兴且具有挑战性的问题,其目标是根据输入文本的语义合成相应的动作。然而,由于缺乏多样化的标注训练数据,现有方法要么局限于特定类型的文本注释,要么在推理时需要在线优化以适应文本,但牺牲了效率和稳定性。本文研究了一种零样本学习范式下的离线开放词汇文本到动作生成方法,该方法既不需要成对训练数据,也无需额外在线优化以适应未见过的文本。受自然语言处理中提示学习的启发,我们预训练了一个动作生成器,学习从被遮蔽的动作中重建完整动作。在推理过程中,我们无需修改动作生成器,而是将输入文本重新格式化为被遮蔽的动作,作为动作生成器的提示来“重建”动作。在构建提示时,提示中的未遮蔽姿态由文本到姿态生成器合成。为监督文本到姿态生成器的优化,我们首次提出了用于衡量文本与3D姿态对齐程度的文本-姿态对齐模型。此外,为防止姿态生成器过拟合有限的训练文本,我们进一步提出了一种新颖的无文本训练机制,在没有任何训练文本的情况下优化文本到姿态生成器。全面的实验结果表明,我们的方法相比基线方法取得了显著提升。代码已公开。