Despite significant progress in 4D generation, rig and motion, the core structural and dynamic components of animation are typically modeled as separate problems. Existing pipelines rely on ground-truth skeletons and skinning weights for motion generation and treat auto-rigging as an independent process, undermining scalability and interpretability. We present RigMo, a unified generative framework that jointly learns rig and motion directly from raw mesh sequences, without any human-provided rig annotations. RigMo encodes per-vertex deformations into two compact latent spaces: a rig latent that decodes into explicit Gaussian bones and skinning weights, and a motion latent that produces time-varying SE(3) transformations. Together, these outputs define an animatable mesh with explicit structure and coherent motion, enabling feed-forward rig and motion inference for deformable objects. Beyond unified rig-motion discovery, we introduce a Motion-DiT model operating in RigMo's latent space and demonstrate that these structure-aware latents can naturally support downstream motion generation tasks. Experiments on DeformingThings4D, Objaverse-XL, and TrueBones demonstrate that RigMo learns smooth, interpretable, and physically plausible rigs, while achieving superior reconstruction and category-level generalization compared to existing auto-rigging and deformation baselines. RigMo establishes a new paradigm for unified, structure-aware, and scalable dynamic 3D modeling.
翻译:尽管四维生成领域已取得显著进展,但作为动画核心结构与动态组件的绑定与运动通常被建模为独立问题。现有流程依赖真实骨骼与蒙皮权重进行运动生成,并将自动绑定视为独立过程,这削弱了系统的可扩展性与可解释性。本文提出RigMo——一个统一的生成式框架,能够直接从原始网格序列中联合学习绑定与运动,无需任何人工提供的绑定标注。RigMo将逐顶点形变编码至两个紧凑的潜在空间:绑定潜在空间(解码为显式高斯骨骼与蒙皮权重)和运动潜在空间(生成时变SE(3)变换)。这些输出共同定义了具有显式结构与连贯运动的可动画网格,实现了可变形物体的前馈式绑定与运动推断。除统一的绑定-运动发现外,我们引入了在RigMo潜在空间中运行的Motion-DiT模型,并证明这些结构感知的潜在表示能自然支持下游运动生成任务。在DeformingThings4D、Objaverse-XL和TrueBones数据集上的实验表明,RigMo能够学习平滑、可解释且物理合理的绑定,同时在重建效果和类别级泛化能力上优于现有自动绑定与形变基线。RigMo为统一化、结构感知且可扩展的动态三维建模建立了新范式。