Text to Motion aims to generate human motions from texts. Existing settings rely on limited Action Texts that include action labels, which limits flexibility and practicability in scenarios difficult to describe directly. This paper extends limited Action Texts to arbitrary ones. Scene texts without explicit action labels can enhance the practicality of models in complex and diverse industries such as virtual human interaction, robot behavior generation, and film production, while also supporting the exploration of potential implicit behavior patterns. However, newly introduced Scene Texts may yield multiple reasonable output results, causing significant challenges in existing data, framework, and evaluation. To address this practical issue, we first create a new dataset HUMANML3D++ by extending texts of the largest existing dataset HUMANML3D. Secondly, we propose a simple yet effective framework that extracts action instructions from arbitrary texts and subsequently generates motions. Furthermore, we also benchmark this new setting with multi-solution metrics to address the inadequacies of existing single-solution metrics. Extensive experiments indicate that Text to Motion in this realistic setting is challenging, fostering new research in this practical direction.
翻译:文本到动作旨在根据文本生成人体动作。现有研究主要依赖包含动作标签的有限动作文本,这在难以直接描述的场景中限制了灵活性与实用性。本文提出将有限的动作文本扩展至任意文本。不含明确动作标签的场景文本能够增强模型在虚拟人交互、机器人行为生成、影视制作等复杂多样行业中的实用性,同时支持对潜在隐含行为模式的探索。然而,新引入的场景文本可能对应多个合理的输出结果,这对现有数据、框架与评估体系提出了重大挑战。针对这一实际问题,我们首先通过扩展现有最大数据集HUMANML3D的文本内容,构建了新的数据集HUMANML3D++。其次,我们提出了一种简洁而有效的框架,该框架可从任意文本中提取动作指令并据此生成动作。此外,我们还采用多解评估指标对这一新设定进行基准测试,以弥补现有单解评估指标的不足。大量实验表明,这一现实设定下的文本到动作任务具有挑战性,将推动该实用方向的新研究。