Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.
翻译:鸟瞰图感知是自动驾驶的基石,它提供了一种统一的空间表征,通过融合环视图像来支持语义分割、3D目标检测及运动预测等多种下游任务。然而,现有大多数BEV感知框架采用端到端训练范式,直接将图像特征变换到BEV空间,仅通过下游任务监督进行优化。这种形式将整个感知过程视为黑箱,往往缺乏明确的3D几何理解与可解释性,导致性能次优。本文强调明确的3D表征对精准BEV感知的重要性,并提出Splat2BEV——一种基于高斯散点辅助的BEV任务框架。Splat2BEV旨在学习兼具语义丰富性与几何精确性的BEV特征表征。我们首先预训练一个高斯生成器,从多视角输入中显式重建3D场景,从而生成几何对齐的特征表征。随后将这些表征投影到BEV空间,作为下游任务的输入。在nuScenes与argoverse数据集上的大量实验表明,Splat2BEV取得了最先进的性能,验证了将显式3D重建融入BEV感知的有效性。