Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occupancy is that it could capture the fine-grained details of critical obstacles in the scene, and thereby facilitate subsequent tasks. Prior or concurrent literature mainly concentrate on a single scene completion task, where we might argue that the potential of this occupancy representation might obsess broader impact. In this paper, we propose OccNet, a multi-view vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy. At the core of OccNet is a general occupancy embedding to represent 3D physical world. Such a descriptor could be applied towards a wide span of driving tasks, including detection, segmentation and planning. To validate the effectiveness of this new representation and our proposed algorithm, we propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes. Empirical experiments show that there are evident performance gain across multiple tasks, e.g., motion planning could witness a collision rate reduction by 15%-58%, demonstrating the superiority of our method.
翻译:人类驾驶员可凭视觉系统轻松描述复杂的交通场景。这种精准感知能力对驾驶员的规划至关重要。为实现这一目标,需要一种几何感知的表示方法,将物理三维场景量化为每个网格单元带有语义标签的结构化网格图(称为三维占据)。相较于边界框形式,占据表示的核心洞察在于能够捕捉场景中关键障碍物的细粒度细节,从而促进后续任务。此前或同期文献主要聚焦于单一场景补全任务,而我们认为这种占据表示的潜力可能具有更广泛的影响。本文提出OccNet——一种基于多视图视觉的级联时序体素解码器管道,用于重建三维占据。OccNet的核心是一种通用占据嵌入,用于表示三维物理世界。该描述符可应用于广泛的驾驶任务,包括检测、分割与规划。为验证这一新表示及其提出算法的有效性,我们构建了OpenOcc——基于nuScenes建立的第一个密集高质量三维占据基准。实验表明,该方法在多个任务上均有显著性能提升,例如运动规划的碰撞率降低了15%-58%,充分证明了其优越性。