We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence
翻译:我们提出一种从单张图像、视频或随机噪声生成时间一致人体动画的方法。该问题传统上被建模为自回归生成任务,即通过回归过往帧来解码未来帧。然而,这种单向生成极易随时间推移产生运动漂移,导致生成的人体动画出现诸如外观畸变等显著伪影,缺乏真实感。我们主张双向时序建模通过大幅抑制人体外观的运动模糊性,能够增强生成网络的时间一致性。为验证这一主张,我们设计了一种基于去噪扩散模型的新型人体动画框架:神经网络通过去噪时序高斯噪声来生成人物图像,其中相邻帧之间的中间结果会进行双向交叉条件约束。实验结果表明,与现有单向方法相比,我们的方法在实现真实时间一致性方面展现出更强的性能。