Feed-forward surround-view autonomous driving scene reconstruction offers fast, generalizable inference ability, which faces the core challenge of ensuring generalization while elevating novel view quality. Due to the surround-view with minimal overlap regions, existing methods typically fail to ensure geometric consistency and reconstruction quality for novel views. To tackle this tension, we claim that geometric information must be learned explicitly, and the resulting features should be leveraged to guide the elevating of semantic quality in novel views. In this paper, we introduce \textbf{Visual Gaussian Driving (VGD)}, a novel feed-forward end-to-end learning framework designed to address this challenge. To achieve generalizable geometric estimation, we design a lightweight variant of the VGGT architecture to efficiently distill its geometric priors from the pre-trained VGGT to the geometry branch. Furthermore, we design a Gaussian Head that fuses multi-scale geometry tokens to predict Gaussian parameters for novel view rendering, which shares the same patch backbone as the geometry branch. Finally, we integrate multi-scale features from both geometry and Gaussian head branches to jointly supervise a semantic refinement model, optimizing rendering quality through feature-consistent learning. Experiments on nuScenes demonstrate that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings, which validates VGD's scalability and high-fidelity surround-view reconstruction.
翻译:前馈环视自动驾驶场景重建具备快速、可泛化的推理能力,其核心挑战在于确保泛化能力的同时提升新视角的合成质量。由于环视视角间重叠区域极少,现有方法通常难以保证新视角的几何一致性与重建质量。为解决这一矛盾,我们认为必须显式学习几何信息,并利用所得特征来指导新视角语义质量的提升。本文提出\textbf{视觉高斯驾驶(VGD)}——一种新颖的前馈端到端学习框架以应对该挑战。为实现可泛化的几何估计,我们设计了VGGT架构的轻量化变体,以高效地从预训练VGGT中提取几何先验知识至几何分支。此外,我们设计了高斯头部模块,通过融合多尺度几何标记来预测用于新视角渲染的高斯参数,该模块与几何分支共享相同的补丁主干网络。最后,我们整合来自几何分支与高斯头部分支的多尺度特征,共同监督语义细化模型,通过特征一致性学习优化渲染质量。在nuScenes数据集上的实验表明,我们的方法在各种设定下的客观指标与主观质量评估中均显著优于现有最优方法,验证了VGD的可扩展性与高保真环视重建能力。