TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.

翻译：文本驱动运动生成是计算机视觉领域一个快速发展的研究方向，其目标是根据文本描述生成真实且与文本对齐的运动序列。现有方法主要集中于时空建模或独立的频域分析，缺乏一个能够联合优化空间、时间与频率域的统一框架。这一局限阻碍了模型同时利用所有域信息的能力，导致生成质量未能达到最优。此外，在运动生成框架中，由噪声引起的与运动无关的线索常与对生成有积极贡献的特征相互纠缠，从而导致运动失真。为解决这些问题，我们提出了三域因果文本驱动运动生成框架（TriC-Motion），这是一种新颖的基于扩散的框架，它将时空频域建模与因果干预相结合。TriC-Motion包含三个核心建模模块，分别用于特定域的建模，即：时序运动编码、空间拓扑建模与混合频率分析。在完成全面建模后，一个分数引导的三域融合模块整合来自三个域的有价值信息，同时确保时序一致性、空间拓扑结构、运动趋势与动态特性。此外，我们精心设计了基于因果关系的反事实运动解耦器，以暴露与运动无关的线索并消除噪声，从而解耦每个域对建模的真实贡献，以实现更优的生成效果。大量实验结果验证了TriC-Motion相比现有最先进方法具有更优越的性能，在HumanML3D数据集上取得了0.612的优异R@1分数。这些结果证明了其生成高保真、连贯、多样且与文本对齐的运动序列的能力。代码发布于：https://caoyiyang1105.github.io/TriC-Motion/。