Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.
翻译:现代以视觉为中心的自动驾驶感知方法广泛采用鸟瞰图(BEV)表示来描述三维场景。尽管该方法比体素表示具有更高的效率,但仅通过单一平面难以细致描述场景的三维结构。为此,我们提出一种三视角(TPV)表示方法,在BEV基础上增加两个垂直平面。通过将三维空间中每个点在三个平面上的投影特征求和,实现对该点的建模。为将图像特征提升至三维TPV空间,我们进一步提出基于Transformer的TPV编码器(TPVFormer)以有效获取TPV特征。我们采用注意力机制聚合每个TPV平面上对应各查询的图像特征。实验表明,在稀疏监督下训练的模型能够有效预测所有体素的语义占用。我们首次证明,仅使用摄像头输入即可在nuScenes数据集的激光雷达分割任务中达到与基于激光雷达方法相当的性能。代码:https://github.com/wzzheng/TPVFormer。