In order to deal with the task of video panoptic segmentation in the wild, we propose a robust integrated video panoptic segmentation solution. In our solution, we regard the video panoptic segmentation task as a segmentation target querying task, represent both semantic and instance targets as a set of queries, and then combine these queries with video features extracted by neural networks to predict segmentation masks. In order to improve the learning accuracy and convergence speed of the solution, we add additional tasks of video semantic segmentation and video instance segmentation for joint training. In addition, we also add an additional image semantic segmentation model to further improve the performance of semantic classes. In addition, we also add some additional operations to improve the robustness of the model. Extensive experiments on the VIPSeg dataset show that the proposed solution achieves state-of-the-art performance with 50.04\% VPQ on the VIPSeg test set, which is 3rd place on the video panoptic segmentation track of the PVUW Challenge 2023.
翻译:针对野外场景下的视频全景分割任务,本文提出了一种鲁棒的综合型视频全景分割解决方案。在该方案中,我们将视频全景分割任务视作分割目标查询任务,将语义目标和实例目标均表示为查询集合,随后将这些查询与神经网络提取的视频特征相结合,以预测分割掩码。为提升方案的学习精度与收敛速度,我们额外引入视频语义分割与视频实例分割任务进行联合训练。此外,我们进一步添加图像语义分割模型,以增强语义类别的性能表现。同时,我们还实施多项附加操作以提升模型鲁棒性。在VIPSeg数据集上的大量实验表明,所提方案在VIPSeg测试集上取得了50.04% VPQ的先进性能,位列PVUW 2023挑战赛视频全景分割赛道第三名。