4D panoptic segmentation is a challenging but practically useful task that requires every point in a LiDAR point-cloud sequence to be assigned a semantic class label, and individual objects to be segmented and tracked over time. Existing approaches utilize only LiDAR inputs which convey limited information in regions with point sparsity. This problem can, however, be mitigated by utilizing RGB camera images which offer appearance-based information that can reinforce the geometry-based LiDAR features. Motivated by this, we propose 4D-Former: a novel method for 4D panoptic segmentation which leverages both LiDAR and image modalities, and predicts semantic masks as well as temporally consistent object masks for the input point-cloud sequence. We encode semantic classes and objects using a set of concise queries which absorb feature information from both data modalities. Additionally, we propose a learned mechanism to associate object tracks over time which reasons over both appearance and spatial location. We apply 4D-Former to the nuScenes and SemanticKITTI datasets where it achieves state-of-the-art results.
翻译:4D全景分割是一项具有挑战性但实际应用价值显著的任务,要求对激光雷达点云序列中的每个点分配语义类别标签,并对单个物体进行分割与时域追踪。现有方法仅利用激光雷达输入,在点稀疏区域信息有限。然而,通过引入RGB摄像头图像(提供基于外观的信息以增强基于几何的激光雷达特征)可缓解此问题。基于此,我们提出4D-Former:一种新型4D全景分割方法,该方法融合激光雷达与图像模态,为输入点云序列预测语义掩码及时域一致的物体掩码。我们通过一组简洁的查询向量对语义类别和物体进行编码,并从两种数据模态中吸收特征信息。此外,我们提出一种可学习的机制用于时域物体轨迹关联,该机制兼顾外观与空间位置。在nuScenes与SemanticKITTI数据集上的实验表明,4D-Former取得了业界领先的性能。