In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.
翻译:在本文中,我们重新审视了基于现代Transformer架构的前馈式三维高斯泼溅(3DGS)预测方法的若干关键设计选择。我们指出,当前普遍将高斯均值沿相机射线回归为深度的做法并非最优,并提出直接通过自监督渲染损失回归三维均值坐标的替代方案。该公式使我们能够从标准编码器结构转向编码器-解码器架构,并引入可学习高斯令牌,从而将预测基元数量与输入图像分辨率及视角数解耦。由此提出的方法TokenGS展现出对姿态噪声和多视角不一致性的更强鲁棒性,同时自然支持令牌空间中高效测试时优化,且不破坏已学习的先验知识。TokenGS在静态与动态场景中均实现了最先进的前馈重建性能,产生更规整的几何结构和更均衡的三维高斯分布,并能够无缝恢复静态-动态分解与场景流等涌现场景属性。