Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction. But they are mostly applied in per-scene optimization or small-baseline settings. While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction. We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360° radiance fields, and robustness to zero-shot and out-of-domain testing.
翻译:辐射场方法已在逼真的新视角合成与几何重建方面取得显著成果,但其应用大多局限于单场景优化或小基线场景。尽管近期若干研究尝试利用Transformer实现大基线的前馈式重建,这些方法均采用标准的全局注意力机制,因而忽略了三维重建的局部特性。本文提出一种在Transformer层中统一局部与全局推理的方法,从而提升重建质量并加速收敛。我们的模型将场景表示为高斯体素,并结合图像编码器与分组注意力层,以实现高效的前馈式重建。实验结果表明,该模型在四块GPU上训练两天后,在360°辐射场重建中表现出高保真度,并在零样本与域外测试中展现出良好的鲁棒性。