Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the cross-attention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: \url{https://github.com/KumapowerLIU/Human-MotionFormer}
翻译:人体运动迁移旨在将目标动态人物的运动迁移至源静态人物,以实现运动合成。源人物与目标运动之间在宏观与细微运动变化上的精准匹配,对提升迁移运动质量至关重要。本文提出Human MotionFormer——一种分层ViT框架,通过全局与局部感知分别捕获宏观与细微运动匹配。该框架包含两个ViT编码器(分别提取目标运动图像与源人物图像特征)以及一个由多个级联模块组成的ViT解码器,用于特征匹配与运动迁移。在每个模块中,我们将目标运动特征设为查询项(Query),源人物特征设为键(Key)与值(Value),通过计算交叉注意力图实现全局特征匹配。在全局交叉注意力计算后,进一步引入卷积层以增强局部感知能力。该匹配过程在形变分支与生成分支中同步实现,以引导运动迁移。训练阶段,我们提出互学习损失(mutual learning loss),使形变分支与生成分支之间实现协同监督,从而获得更优的运动表征。实验表明,Human MotionFormer在定性及定量评估中均达到了当前最优性能。项目主页:\url{https://github.com/KumapowerLIU/Human-MotionFormer}