GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at https://r-itk.github.io/globalsplat/

翻译：三维高斯泼溅中基元的空间高效分配是核心基础，因为它直接决定了表示紧凑性、重建速度与渲染保真度之间的协同效果。现有方案——无论是基于迭代优化还是前馈推理——都因依赖缺乏全局场景感知的局部启发式分配策略而在这些目标间存在显著权衡。具体而言，当前前馈方法大多采用像素对齐或体素对齐策略：通过将像素反投影为稠密的视图对齐基元，这些方法将冗余信息固化到三维资产中。随着输入视角增加，表示规模随之膨胀，全局一致性也变得脆弱。为此，我们提出GlobalSlam框架，其核心理念为“先对齐、后解码”。本方法学习一种紧凑的全局隐式场景表示，该表示在解码显式三维几何之前，先编码多视角输入并解决跨视角对应关系。关键在于，该范式无需依赖预训练的像素预测主干网络或复用稠密基线方法中的隐层特征，即可实现紧凑且全局一致的重建。通过采用渐进式训练策略——逐步提升解码容量——GlobalSplam从机制上避免了表示膨胀问题。在RealEstate10K和ACID数据集上，本模型在仅使用1.6万个高斯基元（显著少于稠密管线所需数量）的情况下实现了具有竞争力的新视角合成性能，同时获得轻量级4MB存储空间。此外，GlobalSplat实现了比基线方法更快的推理速度，单次前向传播耗时低于78毫秒。项目页面见https://r-itk.github.io/globalsplat/