3D photography renders a static image into a video with appealing 3D visual effects. Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints, and finally use an inpainting model to fill those missing/occluded regions. The inpainting model plays a crucial role in rendering quality, but it is normally trained on out-of-domain data. To reduce the training and inference gap, we propose a novel self-supervised diffusion model as the inpainting module. Given a single input image, we automatically construct a training pair of the masked occluded image and the ground-truth image with random cycle-rendering. The constructed training samples are closely aligned to the testing instances, without the need of data annotation. To make full use of the masked images, we design a Masked Enhanced Block (MEB), which can be easily plugged into the UNet and enhance the semantic conditions. Towards real-world animation, we present a novel task: out-animation, which extends the space and time of input objects. Extensive experiments on real datasets show that our method achieves competitive results with existing SOTA methods.
翻译:3D摄影将静态图像渲染为具有引人入胜三维视觉效果的视频。现有方法通常首先进行单目深度估计,然后将输入帧渲染为不同视角的后续帧,最后使用修复模型填补缺失/遮挡区域。修复模型对渲染质量至关重要,但通常基于域外数据进行训练。为了缩小训练与推理之间的差距,我们提出了一种新颖的自监督扩散模型作为修复模块。给定单张输入图像,我们通过随机循环渲染自动构建掩膜遮挡图像与真实图像构成的训练对。所构建的训练样本与测试实例高度一致,无需数据标注。为充分利用掩膜图像,我们设计了掩膜增强模块(MEB),该模块可便捷嵌入UNet并增强语义条件。针对真实场景动画,我们提出新任务:外推动画,该任务拓展了输入对象的时空维度。在真实数据集上的大量实验表明,我们的方法取得了与现有最先进方法相媲美的结果。