Depth-aware video panoptic segmentation is a promising approach to camera based scene understanding. However, the current state-of-the-art methods require costly video annotations and use a complex training pipeline compared to their image-based equivalents. In this paper, we present a new approach titled Unified Perception that achieves state-of-the-art performance without requiring video-based training. Our method employs a simple two-stage cascaded tracking algorithm that (re)uses object embeddings computed in an image-based network. Experimental results on the Cityscapes-DVPS dataset demonstrate that our method achieves an overall DVPQ of 57.1, surpassing state-of-the-art methods. Furthermore, we show that our tracking strategies are effective for long-term object association on KITTI-STEP, achieving an STQ of 59.1 which exceeded the performance of state-of-the-art methods that employ the same backbone network.
翻译:深度感知视频全景分割是一种有前景的基于相机的场景理解方法。然而,与基于图像的等效方法相比,当前最先进的方法需要昂贵的视频标注,并使用复杂的训练流程。在本文中,我们提出了一种名为“统一感知”的新方法,该方法在不依赖视频训练的情况下实现了最先进的性能。我们的方法采用了一种简单的两级级联跟踪算法,该算法(重)利用基于图像的网络中计算出的对象嵌入。在Cityscapes-DVPS数据集上的实验结果表明,我们的方法总体DVPQ达到57.1,超越了现有最先进的方法。此外,我们展示了我们的跟踪策略在KITTI-STEP上对长期对象关联的有效性,实现了59.1的STQ,超过了使用相同骨干网络的最先进方法的性能。