We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence
翻译:我们提出一种从单张图像、视频或随机噪声生成时间连贯人体动画的方法。该问题被建模为自回归生成过程,即通过回归过去帧来解码未来帧。然而,这种单向生成极易随时间产生运动漂移,生成具有显著伪影(如外观畸变)的不真实人体动画。我们提出双向时间建模能够通过大幅抑制人体外观的运动模糊性来增强生成网络的时间一致性。为验证这一主张,我们设计了一种基于去噪扩散模型的新型人体动画框架:神经网络通过学习对时间高斯噪声进行去噪来生成人物图像,其中间结果会在连续帧之间进行双向交叉条件处理。实验表明,与现有单向方法相比,我们的方法在实现逼真时间连贯性的同时展现出显著性能优势。