In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results.
翻译:在媒体制作领域,视频编辑技术发挥着关键作用。近期方法在静态场景的新视角图像合成方面取得了巨大成功,但加入时序信息却增添了额外复杂性。以往模型侧重于使用NeRF隐式表示静态与动态场景,这些模型虽取得令人瞩目的效果,但训练与推理成本高昂,需通过过拟合MLP将场景隐式描述为位置的函数。本文提出ZeST-NeRF这一新方法,可在无需重新训练的情况下为新场景生成时序NeRF。我们仅通过无关场景训练,即能运用多视角合成技术与场景流场估计准确重建新视角,并论证了现有各领域前沿方法均无法有效解决这一新任务,同时展示了本方案的有效性。最终网络在定量指标上提升15%,并产生了显著更优的视觉效果。