Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.
翻译:文本到动作(T2M)生成旨在从自然语言描述中合成逼真的人体动作序列。尽管利用离散动作表示的两阶段框架推动了T2M研究的发展,但它们往往忽略了跨序列的时序一致性,即同一动作的不同实例之间共享的时序结构。这导致了语义错位和物理上不合理的动作。为了解决这一局限,我们提出了TCA-T2M,一个时序一致性感知的T2M生成框架。我们的方法引入了时序一致性感知空间VQ-VAE(TCaS-VQ-VAE)以实现跨序列时序对齐,并结合一个掩码动作Transformer进行文本条件动作生成。此外,一个运动学约束模块被用于减轻离散化伪影,以确保物理合理性。在HumanML3D和KIT-ML基准测试上的实验表明,TCA-T2M实现了最先进的性能,突显了时序一致性对于鲁棒且连贯的T2M生成的重要性。