Video scene parsing incorporates temporal information, which can enhance the consistency and accuracy of predictions compared to image scene parsing. The added temporal dimension enables a more comprehensive understanding of the scene, leading to more reliable results. This paper presents the winning solution of the CVPR2023 workshop for video semantic segmentation, focusing on enhancing Spatial-Temporal correlations with contrastive loss. We also explore the influence of multi-dataset training by utilizing a label-mapping technique. And the final result is aggregating the output of the above two models. Our approach achieves 65.95% mIoU performance on the VSPW dataset, ranked 1st place on the VSPW challenge at CVPR 2023.
翻译:视频场景解析融入了时间信息,与图像场景解析相比能够提升预测的一致性和准确性。增加的时间维度有助于更全面地理解场景,从而获得更可靠的结果。本文介绍了CVPR2023研讨会视频语义分割赛道的冠军方案,重点研究了通过对比损失增强时空关联性。同时,我们利用标签映射技术探索了多样本训练的影响,最终结果由上述两个模型的输出聚合而成。所提方法在VSPW数据集上达到了65.95%的mIoU性能,在CVPR 2023 VSPW挑战赛中排名第一。