In this paper, we present QueryWarp, a novel framework for temporally coherent human motion video translation. Existing diffusion-based video editing approaches that rely solely on key and value tokens to ensure temporal consistency, which scarifies the preservation of local and structural regions. In contrast, we aim to consider complementary query priors by constructing the temporal correlations among query tokens from different frames. Initially, we extract appearance flows from source poses to capture continuous human foreground motion. Subsequently, during the denoising process of the diffusion model, we employ appearance flows to warp the previous frame's query token, aligning it with the current frame's query. This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation. We perform experiments on various human motion video translation tasks, and the results demonstrate that our QueryWarp framework surpasses state-of-the-art methods both qualitatively and quantitatively.
翻译:本文提出QueryWarp,一种用于时间一致的人体运动视频翻译的新型框架。现有基于扩散模型的视频编辑方法仅依赖键和值标记来确保时间一致性,这牺牲了局部和结构区域的保留。相比之下,我们通过构建不同帧间查询标记的时间相关性,考虑补充性查询先验。首先,从源姿态中提取外观流以捕捉连续的人体前景运动。随后,在扩散模型的去噪过程中,利用外观流将前一帧的查询标记变形,使其与当前帧的查询对齐。这种查询变形对自注意力层的输出施加了显式约束,有效保证了时间一致的翻译。我们在多种人体运动视频翻译任务上进行实验,结果表明,QueryWarp框架在定性和定量上均超越了现有最先进方法。