ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

Kinematic rigs provide a structured interface for articulating 3D meshes, but they lack an inherent representation of the plausible manifold of joint configurations for a given asset. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters often leads to semantic or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feed-forward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce artist-authored 4D datasets, ViPS transfers generative video priors into a universal distribution over a given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce asset-specific validity without requiring manual regularizers. Our model learns a smooth, compact, and controllable pose space that supports diverse sampling, manifold projection for inverse kinematics, and temporally coherent trajectories for keyframing. Furthermore, the distilled 3D pose samples serve as precise semantic proxies for guiding video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely on video priors, matches the performance of state-of-the-art methods trained on synthetic artist-created 4D data in both plausibility and diversity. Most importantly, as a universal model, ViPS demonstrates robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

翻译：运动学骨架为三维网格的关节化提供结构化接口，但缺乏对给定资产生效关节构型流形的固有表示。缺失此类姿态空间时，原始骨架参数的随机采样或手动操控常导致语义或几何违规，如解剖学上的过度伸展及非物理自交。我们提出视频信息驱动姿态空间（ViPS），一种通过蒸馏预训练视频扩散模型中的运动先验，为自动蒙皮网格发现有效关节构型隐式分布的前馈框架。与依赖稀缺艺术家创作的四维数据集不同，ViPS将生成式视频先验转化为给定骨架参数化上的通用分布。应用于蒙皮网格的可微几何验证器无需人工正则项即可强制执行资产特异性有效性。本模型学习到平滑、紧凑且可控的姿态空间，支持多样化采样、逆运动学的流形投影及关键帧的时间连贯轨迹。此外，蒸馏出的三维姿态样本可作为精确语义代理以引导视频扩散，有效闭环生成式二维先验与结构化三维运动控制。实验表明，仅基于视频先验训练的ViPS在合理性与多样性上均达到基于合成艺术家创作四维数据的最先进方法水平。最重要的是，作为通用模型，ViPS展现出对分布外物种及未见骨架拓扑的鲁棒零样本泛化能力。