Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.
翻译:近年来,基于文本提示的人体运动生成取得了显著进展。然而,由于缺乏细粒度的部件级运动标注,现有方法主要依赖于序列级或动作级描述,这限制了对个体身体部位的可控性。在本工作中,我们利用大语言模型(LLMs)的推理能力,构建了一个具有原子化、时间感知的部件级文本标注的高质量运动数据集。与先前仅提供固定时间段同步部件描述或完全依赖全局序列标签的数据集不同,我们的数据集以精细的时间分辨率捕捉异步且语义独立的部件运动。基于此数据集,我们提出了一种基于扩散的部件感知运动生成框架,即弗兰肯运动(FrankenMotion),其中每个身体部位由其自身的时间结构化文本提示引导。据我们所知,这是首个提供原子化、时间感知的部件级运动标注,并实现同时具备空间(身体部位)和时间(原子动作)控制的运动生成模型的研究。实验表明,弗兰肯运动在针对我们场景调整和重训练的所有基线模型上均表现更优,且我们的模型能够组合训练中未见过的运动。我们的代码和数据集将在发表后公开提供。