Deep learning is providing a wealth of new approaches to the old problem of novel view synthesis, from Neural Radiance Field (NeRF) based approaches to end-to-end style architectures. Each approach offers specific strengths but also comes with specific limitations in their applicability. This work introduces ViewFusion, a state-of-the-art end-to-end generative approach to novel view synthesis with unparalleled flexibility. ViewFusion consists in simultaneously applying a diffusion denoising step to any number of input views of a scene, then combining the noise gradients obtained for each view with an (inferred) pixel-weighting mask, ensuring that for each region of the target scene only the most informative input views are taken into account. Our approach resolves several limitations of previous approaches by (1) being trainable and generalizing across multiple scenes and object classes, (2) adaptively taking in a variable number of pose-free views at both train and test time, (3) generating plausible views even in severely undetermined conditions (thanks to its generative nature) -- all while generating views of quality on par or even better than state-of-the-art methods. Limitations include not generating a 3D embedding of the scene, resulting in a relatively slow inference speed, and our method only being tested on the relatively small dataset NMR. Code is available.
翻译:深度学习为旧有的新视角合成问题提供了大量新方法,从基于神经辐射场(NeRF)的方法到端到端风格架构。每种方法都有其特定优势,但在应用中也存在特定局限性。本文提出了ViewFusion,这是一种具有无与伦比灵活性的、最先进的端到端生成式新视角合成方法。ViewFusion的核心在于同时将扩散去噪步骤应用于场景的任意数量的输入视角,然后通过(推断出的)像素加权掩膜将每个视角获得的噪声梯度组合起来,确保只考虑目标场景每个区域中最具信息量的输入视角。我们的方法解决了先前方法的几个局限性:(1)它可训练且能跨多个场景和物体类别泛化;(2)在训练和测试时均能自适应地接受可变数量的无姿态输入视角;(3)即使在严重欠确定条件下也能生成合理视角(得益于其生成特性)——同时生成的视角质量与最先进方法相当甚至更优。局限性包括:不生成场景的3D嵌入,导致推理速度相对较慢,且我们的方法仅在相对较小的数据集NMR上进行了测试。代码已公开。