We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.
翻译:我们研究从非人类角色的单目视频直接推导出初始人类复现的问题。目标并非重建源角色本身,而是将其运动重新诠释为一种合理的、可编辑的人类表现,以服务于下游动画创作。此任务具有挑战性,因为现有基于视频的动作捕捉方法很大程度上局限于以人为中心的结构空间,而运动重定向方法通常需要结构化的3D源运动及已知的源拓扑结构。我们的关键见解在于,稀疏的局部关节运动线索能够在较大的结构差异中保留关键动态,为从角色视频到人类复现提供稳定的桥梁。基于这一观察,我们提出AnyAct,将角色视频驱动的人类复现形式化为基于可迁移的稀疏局部二维关节运动的条件式人类运动生成。为实现这一方法,我们引入了三个关键设计:通过增强的3D到2D投影实现仅基于人类运动的监督、渐进式3D到2D训练以缓解条件歧义,以及全局-局部运动解耦以实现可靠的局部运动控制。我们还构建了一个主要涵盖多样化非人类角色视频的基准测试。在基准上的实验表明,AnyAct能够生成高保真的初始人类复现结果,保留参考视频中角色的关键动态,进一步的消融研究验证了其核心设计的有效性。