3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, including map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes are available at \url{https://github.com/ViewFormerOcc/ViewFormer-Occ}.
翻译:3D占据感知作为驾驶场景中的先进感知技术,通过将物理空间量化为网格地图来表示整个场景,无需区分前景与背景。目前广泛采用的投影优先可变形注意力机制虽能高效地将图像特征转换为3D表征,但由于传感器部署限制,在多视角特征聚合方面面临挑战。为解决此问题,我们提出学习优先的视图注意力机制以实现有效的多视角特征聚合。此外,我们展示了该视图注意力机制在多种多视角3D任务(包括地图构建与3D目标检测)中的可扩展性。基于所提出的视图注意力及额外的多帧流式时序注意力模块,我们提出了ViewFormer——一个以视觉为中心的基于Transformer的时空特征聚合框架。为深入探索占据级场景流表征,我们在现有高质量数据集基础上构建了FlowOcc3D基准测试集。在该基准上的定性与定量分析揭示了表征细粒度动态场景的潜力。大量实验表明,我们的方法显著优于现有最先进方法。代码已发布于\url{https://github.com/ViewFormerOcc/ViewFormer-Occ}。