GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time

This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model. With the joint learning mechanism, the proposed framework can inherently estimate robust relative pose information from the image observations and thus primarily alleviate the requirement of real camera poses. Moreover, we implement a deferred back-propagation mechanism that enables high-resolution training and inference, overcoming the resolution constraints of previous methods. To enhance the speed and efficiency, we further introduce a progressive Gaussian cache module that dynamically adjusts during training and inference. As the first pose-free generalizable 3D-GS framework, GGRt achieves inference at $\ge$ 5 FPS and real-time rendering at $\ge$ 100 FPS. Through extensive experimentation, we demonstrate that our method outperforms existing NeRF-based pose-free techniques in terms of inference speed and effectiveness. It can also approach the real pose-based 3D-GS methods. Our contributions provide a significant leap forward for the integration of computer vision and computer graphics into practical applications, offering state-of-the-art results on LLFF, KITTI, and Waymo Open datasets and enabling real-time rendering for immersive experiences.

翻译：本文提出GGRt，一种用于可泛化新视角合成的创新方法，该方法缓解了对真实相机位姿的需求、处理高分辨率图像的复杂性以及冗长的优化过程，从而增强了三维高斯泼溅（3D-GS）在真实场景中的适用性。具体而言，我们设计了一个新型联合学习框架，包含迭代位姿优化网络（IPO-Net）和可泛化三维高斯模型（G-3DG）。通过联合学习机制，该框架能够从图像观测中固有地估计鲁棒的相对位姿信息，从而显著减轻对真实相机位姿的要求。此外，我们实现了延迟反向传播机制，支持高分辨率训练与推理，克服了先前方法的分辨率限制。为提升速度与效率，我们进一步引入渐进式高斯缓存模块，该模块在训练与推理过程中动态调整。作为首个无位姿的可泛化3D-GS框架，GGRt实现了≥5 FPS的推理速度与≥100 FPS的实时渲染。通过大量实验，我们证明该方法在推理速度和有效性上优于现有基于NeRF的无位姿技术，并能接近基于真实位姿的3D-GS方法。我们的贡献为计算机视觉与计算机图形学在实践应用中的融合提供了重大突破，在LLFF、KITTI和Waymo Open数据集上取得了最先进的结果，并支持沉浸式体验的实时渲染。