Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
翻译:摘要:运动捕捉技术如今已支撑起远超数字人领域的各类内容创作,然而现有大部分流水线仍局限于特定物种或模板。我们将这一差距定义为类别无关运动捕捉(CAMoCap):给定一段单目视频和一个任意装配的三维资产作为提示,目标是重建诸如BVH格式的基于旋转的动画,从而直接驱动该特定资产。我们提出了MoCapAnything——一种参考引导式的分解框架,首先预测三维关节轨迹,随后通过考虑约束的逆运动学恢复资产特定的旋转。该系统包含三个可学习模块和一个轻量级逆运动学阶段:(1)参考提示编码器,从资产的骨架、网格和渲染图像中提取每关节查询;(2)视频特征提取器,计算密集视觉描述符并重建粗略的四维形变网格,以弥合视频与关节空间之间的差距;(3)统一运动解码器,融合这些线索以生成时间连贯的轨迹。我们还整理了包含1038条运动片段的Truebones Zoo数据集,每条片段提供标准化的骨架-网格-渲染三元组。在领域内基准测试和真实世界视频上的实验表明,MoCapAnything能够生成高质量骨骼动画,并在异构装配间展现有意义的跨物种重定向,从而实现对任意资产的可扩展、提示驱动的三维运动捕捉。项目页面:https://animotionlab.github.io/MoCapAnything/