Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction. Furthermore, we extend our method with diffusion models, pioneering feed-forward multi-modal generation of 3D driving scenes.
翻译:先前采用基于像素的高斯表示的工作已证明在前馈式稀疏视图重建中的有效性。然而,这种表示需要视图间重叠以实现精确的深度估计,并且受到物体遮挡和视锥截断的挑战。因此,这些方法需要以场景为中心的数据采集来维持视图间重叠和完整的场景可见性,以避免遮挡和截断,这限制了它们在以场景为中心的重建任务中的适用性。相比之下,在自动驾驶场景中,一种更实用的范式是以自我为中心的重建,其特点是视图间重叠极少且频繁出现遮挡和截断。基于像素的表示的局限性因此阻碍了先前工作在此任务中的应用。鉴于此,本文对不同表示进行了深入分析,并引入了全向高斯表示以及量身定制的网络设计,以互补其优势并缓解其缺点。实验表明,我们的方法在以自我为中心的重建任务中显著超越了最先进的方法pixelSplat和MVSplat,并在以场景为中心的重建任务中达到了与先前工作相当的性能。此外,我们通过扩散模型扩展了我们的方法,开创性地实现了3D驾驶场景的前馈式多模态生成。