We present a one-shot method to infer and render a photorealistic 3D representation from a single unposed image (e.g., face portrait) in real-time. Given a single RGB input, our image encoder directly predicts a canonical triplane representation of a neural radiance field for 3D-aware novel view synthesis via volume rendering. Our method is fast (24 fps) on consumer hardware, and produces higher quality results than strong GAN-inversion baselines that require test-time optimization. To train our triplane encoder pipeline, we use only synthetic data, showing how to distill the knowledge from a pretrained 3D GAN into a feedforward encoder. Technical contributions include a Vision Transformer-based triplane encoder, a camera data augmentation strategy, and a well-designed loss function for synthetic data training. We benchmark against the state-of-the-art methods, demonstrating significant improvements in robustness and image quality in challenging real-world settings. We showcase our results on portraits of faces (FFHQ) and cats (AFHQ), but our algorithm can also be applied in the future to other categories with a 3D-aware image generator.
翻译:我们提出了一种单次推断方法,可从单张非摆拍图像(如人脸肖像)实时推断并渲染出逼真的三维表示。给定单张RGB输入,我们的图像编码器直接预测神经辐射场的规范三平面表示,通过体渲染实现三维感知的新视角合成。该方法在消费级硬件上达到24帧/秒的实时速度,且生成质量优于需要测试时优化的强GAN反演基线方法。为训练三平面编码器流水线,我们仅使用合成数据,展示了如何将预训练三维GAN的知识蒸馏至前馈编码器。技术贡献包括基于Vision Transformer的三平面编码器、相机数据增强策略及适用于合成数据训练的精心设计的损失函数。我们与最先进方法进行了基准测试,证明了在具有挑战性的真实场景中鲁棒性和图像质量的显著提升。我们在人脸(FFHQ)和猫脸(AFHQ)肖像上展示了结果,但我们的算法未来也可应用于其他具有三维感知图像生成器的类别。