Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics. Project page can be found at https://sites.google.com/view/smtnet.
翻译:捕捉并保持动作语义是动画角色间动作重定向的关键。然而,以往的大多数工作忽略了语义信息,或依赖于人工设计的关节级表示。本文提出一种新颖的语义感知动作重定向(SMT)方法,利用视觉语言模型的优势提取并维护有意义的动作语义。我们采用可微分模块渲染三维动作,随后通过将渲染图像输入视觉语言模型并对其提取的语义嵌入进行对齐,将高层动作语义融入动作重定向流程。为确保同时保留细粒度动作细节与高层语义,我们采用两阶段流水线:骨架感知预训练,以及结合语义与几何约束的微调。实验结果表明,该方法在准确保留动作语义的同时,能生成高质量的动作重定向结果。项目页面详见https://sites.google.com/view/smtnet。